Difference between revisions of "OCR"

From Gramps
Jump to: navigation, search
Line 1: Line 1:
While researching your family tree, you will find textual publications or administrative documents. You may avoid long and annoying work into GRAMPS by using optical character recognition ([http://en.wikipedia.org/wiki/Optical_character_recognition OCR]).
+
While researching your family tree, you will use books or administrative documents. You may avoid long and annoying work or transcribing texts into GRAMPS by using optical character recognition ([http://en.wikipedia.org/wiki/Optical_character_recognition OCR]).
  
Here we show how you can work on your picture to make it to text !!!
+
Here we show you how to work on your picture or scan of a document to change it into typed text.
  
 
[[Category:Genealogy]]
 
[[Category:Genealogy]]
Line 7: Line 7:
 
==How this work ?==
 
==How this work ?==
  
* picture need to be contrasted with a good resolution
+
* You picture needs to have a strong contrast (black text, white backgrount), and have a good resolution.
* OCR programs read the picture and with forms librairies, detect the characters in order to make some correspond the form to the awaited character
+
* The OCR programs scans the picture and uses a glyphs libraries to detect the characters. Those that are recognised are transformed in the corresponding character.
* dictionnaries will be used for minimized errors. They make comparison between existing words and your result.
+
* Dictionnaries will be used to minimize errors. These compare the resulting words with existing words to determine the outcome and guess not fully recognized words.
* Some programs allow bold, italic or custom fonts size.
+
* Some programs recognize bold, italic or custom fonts size.
  
==Using into GRAMPS==
+
==Using OCR with GRAMPS==
 +
 
 +
There is not a lot of OCR open sources programs, and those that exist mostly use the same backend. For ''Intelligent Word Recognition (IWR)'' or ''[http://en.wikipedia.org/wiki/Intelligent_Character_Recognition Intelligent Character Recognition (ICR)]'', used for written certificates, they are even rarer.
 +
 
 +
You can use some of the programs alongside GRAMPS. For the backend tools, only command line tools are available (use ''-h'' for the options), but fortunately some GUI's have been made too:
  
There is not a lot of OCR open sources programs.
 
Intelligent Word Recognition (IWR), [http://en.wikipedia.org/wiki/Intelligent_Character_Recognition Intelligent Character Recognition (ICR)] for written certificates are hight level. They are used on financial, historical sectors. Some programs may be used as third party of GRAMPS. ''-h'' on command line for output options.
 
 
* [http://code.google.com/p/tesseract-ocr/ Tesseract] may be a good solution for english reader but it currently only recognizes US-ASCII characters ...
 
* [http://code.google.com/p/tesseract-ocr/ Tesseract] may be a good solution for english reader but it currently only recognizes US-ASCII characters ...
* [http://www.geocities.com/claraocr/ claraocr] seems to be able to learn  but I do not find any documentation. Also, need to use pgm or pbm file format.
+
* [http://jocr.sourceforge.net/ GOCR/JOCR] is used by [http://www.xsane.org/ xsane] and [http://kooka.kde.org/ kooka]. It can generate custom database characters from a picture with the command:
* [http://jocr.sourceforge.net/ GOCR/JOCR] is using by [http://www.xsane.org/ xsane] and [http://kooka.kde.org/ kooka], may generate a custom database characters with:
+
mkdir db
'''<code>mkdir db'''<p>
+
gocr -p db -m 130 -m 256 certificate.png
'''gocr -p db -m 130 -m 256 certificate.png '''</p></code>
+
:This will ask you of each new letter it recognizes what the value is (a,b,...), and will generate a new index (db.list) + portable-bitmap (pbm) for your letters. Each key entry on db.list is one of the .pbm files and is connected to your custom value (a, b, c ...)  
<blockquote>
+
:This is however not very successfull on written text.
This will ask you for each new letters and will generate a new index (db.list) + portable-bitmap (pbm) for your letters. Each key entry on db.list is one of this .pbm related to your custom value (a, b, c ...)  
+
</blockquote>
+
Not very successfull on written text.
+
 
* With [http://www.gnu.org/software/ocrad/ocrad.html Ocrad], you need to use pgm file format
 
* With [http://www.gnu.org/software/ocrad/ocrad.html Ocrad], you need to use pgm file format
 +
* People using KDE will probably know [http://kooka.kde.org/ Kooka] the standard KDE scanning tool with builtin OCR (using [http://jocr.sourceforge.net/ GOCR], [http://www.gnu.org/software/ocrad/ocrad.html Ocrad] which are OSS, or the commercial KADMOS).
 +
 +
* [http://www.geocities.com/claraocr/ claraocr] seems to be able to learn  but I do not find any documentation. Also, need to use pgm or pbm file format.
 +
 +
*Also, [http://www.corollarium.com/conjecture/index.php Conjecture] is an OCR third party tool who incorporate both open sources programs code bases.
  
Also, [http://www.corollarium.com/conjecture/index.php Conjecture] is an OCR third party tool who incorporate both open sources programs code bases.
+
==Example==
 +
I think the easiest would be to use kooka or xsane, scan an image and do OCR....

Revision as of 13:57, 5 April 2007

While researching your family tree, you will use books or administrative documents. You may avoid long and annoying work or transcribing texts into GRAMPS by using optical character recognition (OCR).

Here we show you how to work on your picture or scan of a document to change it into typed text.

How this work ?

  • You picture needs to have a strong contrast (black text, white backgrount), and have a good resolution.
  • The OCR programs scans the picture and uses a glyphs libraries to detect the characters. Those that are recognised are transformed in the corresponding character.
  • Dictionnaries will be used to minimize errors. These compare the resulting words with existing words to determine the outcome and guess not fully recognized words.
  • Some programs recognize bold, italic or custom fonts size.

Using OCR with GRAMPS

There is not a lot of OCR open sources programs, and those that exist mostly use the same backend. For Intelligent Word Recognition (IWR) or Intelligent Character Recognition (ICR), used for written certificates, they are even rarer.

You can use some of the programs alongside GRAMPS. For the backend tools, only command line tools are available (use -h for the options), but fortunately some GUI's have been made too:

  • Tesseract may be a good solution for english reader but it currently only recognizes US-ASCII characters ...
  • GOCR/JOCR is used by xsane and kooka. It can generate custom database characters from a picture with the command:
mkdir db
gocr -p db -m 130 -m 256 certificate.png
This will ask you of each new letter it recognizes what the value is (a,b,...), and will generate a new index (db.list) + portable-bitmap (pbm) for your letters. Each key entry on db.list is one of the .pbm files and is connected to your custom value (a, b, c ...)
This is however not very successfull on written text.
  • With Ocrad, you need to use pgm file format
  • People using KDE will probably know Kooka the standard KDE scanning tool with builtin OCR (using GOCR, Ocrad which are OSS, or the commercial KADMOS).
  • claraocr seems to be able to learn but I do not find any documentation. Also, need to use pgm or pbm file format.
  • Also, Conjecture is an OCR third party tool who incorporate both open sources programs code bases.

Example

I think the easiest would be to use kooka or xsane, scan an image and do OCR....