Difference between revisions of "OCR"

From Gramps
Jump to: navigation, search
(Using OCR with GRAMPS)
(Example and related informations)
(20 intermediate revisions by 5 users not shown)
Line 1: Line 1:
While researching your family tree, you will use books or administrative documents. You may avoid long and annoying work or transcribing texts into GRAMPS by using optical character recognition ([http://en.wikipedia.org/wiki/Optical_character_recognition OCR]).
+
{{languages|OCR}}
 +
 
 +
While researching your family tree, you will use books or administrative documents. You may avoid long and annoying work or transcribing texts into Gramps by using optical character recognition ([http://en.wikipedia.org/wiki/Optical_character_recognition OCR]).
  
 
Here we show you how to work on your picture or scan of a document to change it into typed text.
 
Here we show you how to work on your picture or scan of a document to change it into typed text.
 
[[Category:Genealogy]]
 
  
 
==How does this work?==
 
==How does this work?==
Line 12: Line 12:
 
* Some programs recognize bold, italic or custom fonts size.
 
* Some programs recognize bold, italic or custom fonts size.
  
==Using OCR with GRAMPS==
+
==Using OCR with Gramps==
  
 
There is not a lot of OCR open sources programs, and those that exist mostly use the same backend. For ''Intelligent Word Recognition (IWR)'' or ''[http://en.wikipedia.org/wiki/Intelligent_Character_Recognition Intelligent Character Recognition (ICR)]'', used for written certificates, they are even rarer.
 
There is not a lot of OCR open sources programs, and those that exist mostly use the same backend. For ''Intelligent Word Recognition (IWR)'' or ''[http://en.wikipedia.org/wiki/Intelligent_Character_Recognition Intelligent Character Recognition (ICR)]'', used for written certificates, they are even rarer.
  
You can use some of the programs alongside GRAMPS. For the backend tools, only command line tools are available (use ''-h'' for the options), but fortunately some GUI's have been made too:
+
You can use some of the programs alongside Gramps. For the backend tools, only command line tools are available (use ''-h'' for the options), but fortunately some GUI's have been made too:
 +
 
 +
* [http://www.claraocr.org/en.html ClaraOCR] is intended for large scale digitalization projects. It features a powerful GUI and a web interface for cooperative digitalization of books.
 +
 
 +
* [http://conjecture.sourceforge.net/ Conjecture] is an OCR third party tool who incorporate both open sources programs code bases.
 +
 
 +
* The [http://gamera.informatik.hsnr.de/ Gamera Project] is a framework for the creation of custom OCR applications. It supports the training of custom character shapes.
  
* [http://code.google.com/p/tesseract-ocr/ Tesseract] may be a good solution for english reader but it currently only recognizes US-ASCII characters ...
 
 
* [http://jocr.sourceforge.net/ GOCR/JOCR] is used by [http://www.xsane.org/ xsane] and [http://kooka.kde.org/ kooka]. It can generate custom database characters from a picture with the command:
 
* [http://jocr.sourceforge.net/ GOCR/JOCR] is used by [http://www.xsane.org/ xsane] and [http://kooka.kde.org/ kooka]. It can generate custom database characters from a picture with the command:
 
  mkdir ./db
 
  mkdir ./db
Line 26: Line 31:
 
* With [http://www.gnu.org/software/ocrad/ocrad.html Ocrad], you need to use pgm file format.
 
* With [http://www.gnu.org/software/ocrad/ocrad.html Ocrad], you need to use pgm file format.
 
* People using KDE will probably know [http://kooka.kde.org/ Kooka] the standard KDE scanning tool with builtin OCR (using [http://jocr.sourceforge.net/ GOCR], [http://www.gnu.org/software/ocrad/ocrad.html Ocrad], which are OSS, or the commercial KADMOS).
 
* People using KDE will probably know [http://kooka.kde.org/ Kooka] the standard KDE scanning tool with builtin OCR (using [http://jocr.sourceforge.net/ GOCR], [http://www.gnu.org/software/ocrad/ocrad.html Ocrad], which are OSS, or the commercial KADMOS).
 
* [http://www.geocities.com/claraocr/ claraocr] seems to be able to [http://stderr.org/doc/clara/clara-tut.html#1.3 learn]. It opens natively [http://en.wikipedia.org/wiki/Portable_pixmap PNM] file formats.
 
  
* Also, [http://www.corollarium.com/conjecture/index.php Conjecture] is an OCR third party tool who incorporate both open sources programs code bases.
+
* [http://live.gnome.org/OCRFeeder OCRFeeder] is a document layout analysis and optical character recognition system.
 +
 
 +
* The [http://code.google.com/p/ocropus/ OCRopus] engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods. See, also [http://docs.google.com/Doc?id=dfxcv4vc_67g844kf hOCR format] and [http://code.google.com/p/hocr-tools/ hOCR tools].
  
* [http://ldp.library.jhu.edu/projects/gamera Gamera Project] looks a very promising framework written in python ...
+
* [http://code.google.com/p/tesseract-ocr/ Tesseract] may be a good solution. Version 2.00 and later support English, French, Italian, German, Spanish, Dutch.
  
==Example==
+
==Example and related information==
 
I think the easiest would be to use kooka or xsane, scan an image and do OCR or importing an existing image into kooka and do OCR.
 
I think the easiest would be to use kooka or xsane, scan an image and do OCR or importing an existing image into kooka and do OCR.
  
A [http://groundstate.ca/ocr review] of free optical character recognition software under Linux.
+
*There is a [http://en.wikipedia.org/wiki/CuneiForm_(software) CuneiForm] for ocr image to text.  
  
A [http://www.wired.com/gadgets/miscellaneous/news/2007/06/iliad_scan robot scans Ancient Manuscript in 3-D]. Also, This summer a group of graduate and undergraduate students of Greek will gather at the [http://chs.harvard.edu/ Center for Hellenic Studies] in Washington, D.C., to produce XML transcriptions of the text. Eventually, their work will be posted online for anyone to search under [http://creativecommons.org/about/licenses/ Creative Commons] licence.
+
*A [http://www.wired.com/gadgets/miscellaneous/news/2007/06/iliad_scan robot scans Ancient Manuscript in 3-D]. Also, This summer a group of graduate and undergraduate students of Greek will gather at the [http://chs.harvard.edu/ Center for Hellenic Studies] in Washington, D.C., to produce XML transcriptions of the text. Eventually, their work will be posted online for anyone to search under [http://creativecommons.org/about/licenses/ Creative Commons] licence.
 +
 
 +
*[http://www.geocities.jp/takascience/lego/fabs_en.html How to make a full Auto Book Scanner] with LEGO ...
 +
 
 +
*[http://spectrum.ieee.org/automaton/robotics/robotics-software/book-flipping-scanning Superfast scanner lets you digitize a book by rapidly flipping pages]
 +
 
 +
*[http://bookliberator.com bookliberator] is a set of free software and hardware to digitize books: it lets you photograph all the pages in a book without harming the book. The resulting images can be processed with free, open source software to make user-friendly files in a variety of formats.
 +
 
 +
*[http://sites.google.com/site/decapodproject/ decapodproject] is a project focused on building a low-cost digitization solution that will allow for rare materials, materials held in collections without large budgets, and other scholarly content to be digitized into a high-quality PDF format. This project will work to incorporate the hardware and software necessary to accomplish this goal.
 +
 
 +
*[http://www.diybookscanner.org/ Do It Yourself Book Scanner] and how to make a book scanner out of stuff we can find in dumpsters, or buy cheaply, including off-the-shelf, cheap digital camers. Se also the related [http://code.google.com/p/diy-ebook-creator/ ebook project].
 +
 
 +
*[http://www.bookscanner.fr/ Build your own Book scanner (in french)].
 +
 
 +
[[Category:Sources]]
 +
[[Category:Genealogy]]

Revision as of 06:46, 28 October 2013

While researching your family tree, you will use books or administrative documents. You may avoid long and annoying work or transcribing texts into Gramps by using optical character recognition (OCR).

Here we show you how to work on your picture or scan of a document to change it into typed text.

How does this work?

  • You picture needs to have a strong contrast (black text, white background), and a good resolution.
  • The OCR program scans the picture and uses glyph libraries to detect the characters. Those that are recognised are transformed into the corresponding character.
  • Dictionaries will be used to minimize errors. These compare the resulting words with existing words to determine the outcome and guess the words that were not fully recognized.
  • Some programs recognize bold, italic or custom fonts size.

Using OCR with Gramps

There is not a lot of OCR open sources programs, and those that exist mostly use the same backend. For Intelligent Word Recognition (IWR) or Intelligent Character Recognition (ICR), used for written certificates, they are even rarer.

You can use some of the programs alongside Gramps. For the backend tools, only command line tools are available (use -h for the options), but fortunately some GUI's have been made too:

  • ClaraOCR is intended for large scale digitalization projects. It features a powerful GUI and a web interface for cooperative digitalization of books.
  • Conjecture is an OCR third party tool who incorporate both open sources programs code bases.
  • The Gamera Project is a framework for the creation of custom OCR applications. It supports the training of custom character shapes.
  • GOCR/JOCR is used by xsane and kooka. It can generate custom database characters from a picture with the command:
mkdir ./db
gocr -p ./db/ -m 130 -m 256 certificate.pnm
This will ask you of each new letter it recognizes what the value is (a,b,...), and will generate a new index (db.list) + portable-bitmap (pbm) for your letters. Each key entry on db.list is one of the .pbm files and is connected to your custom value (a, b, c ...)
This is however not very successfull on written text.
  • With Ocrad, you need to use pgm file format.
  • People using KDE will probably know Kooka the standard KDE scanning tool with builtin OCR (using GOCR, Ocrad, which are OSS, or the commercial KADMOS).
  • OCRFeeder is a document layout analysis and optical character recognition system.
  • The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods. See, also hOCR format and hOCR tools.
  • Tesseract may be a good solution. Version 2.00 and later support English, French, Italian, German, Spanish, Dutch.

Example and related information

I think the easiest would be to use kooka or xsane, scan an image and do OCR or importing an existing image into kooka and do OCR.

  • bookliberator is a set of free software and hardware to digitize books: it lets you photograph all the pages in a book without harming the book. The resulting images can be processed with free, open source software to make user-friendly files in a variety of formats.
  • decapodproject is a project focused on building a low-cost digitization solution that will allow for rare materials, materials held in collections without large budgets, and other scholarly content to be digitized into a high-quality PDF format. This project will work to incorporate the hardware and software necessary to accomplish this goal.