OCR

Languages:

While researching your family tree, you will use books or administrative documents. You may avoid long and annoying work or transcribing texts into GRAMPS by using optical character recognition (OCR).

Here we show you how to work on your picture or scan of a document to change it into typed text.

How does this work?

You picture needs to have a strong contrast (black text, white background), and a good resolution.
The OCR program scans the picture and uses glyph libraries to detect the characters. Those that are recognised are transformed into the corresponding character.
Dictionaries will be used to minimize errors. These compare the resulting words with existing words to determine the outcome and guess the words that were not fully recognized.
Some programs recognize bold, italic or custom fonts size.

Using OCR with GRAMPS

There is not a lot of OCR open sources programs, and those that exist mostly use the same backend. For Intelligent Word Recognition (IWR) or Intelligent Character Recognition (ICR), used for written certificates, they are even rarer.

You can use some of the programs alongside GRAMPS. For the backend tools, only command line tools are available (use -h for the options), but fortunately some GUI's have been made too:

Tesseract may be a good solution. Version 2.00 and later support English, French, Italian, German, Spanish, Dutch.
GOCR/JOCR is used by xsane and kooka. It can generate custom database characters from a picture with the command:

mkdir ./db
gocr -p ./db/ -m 130 -m 256 certificate.pnm

This will ask you of each new letter it recognizes what the value is (a,b,...), and will generate a new index (db.list) + portable-bitmap (pbm) for your letters. Each key entry on db.list is one of the .pbm files and is connected to your custom value (a, b, c ...)

This is however not very successfull on written text.

With Ocrad, you need to use pgm file format.
People using KDE will probably know Kooka the standard KDE scanning tool with builtin OCR (using GOCR, Ocrad, which are OSS, or the commercial KADMOS).

Also, Conjecture is an OCR third party tool who incorporate both open sources programs code bases.

The Gamera Project is a framework for the creation of custom OCR applications. It supports the training of custom character shapes.

Example and related informations

I think the easiest would be to use kooka or xsane, scan an image and do OCR or importing an existing image into kooka and do OCR.

There is a CuneiForm for ocr image to text.

A robot scans Ancient Manuscript in 3-D. Also, This summer a group of graduate and undergraduate students of Greek will gather at the Center for Hellenic Studies in Washington, D.C., to produce XML transcriptions of the text. Eventually, their work will be posted online for anyone to search under Creative Commons licence.

How to make a full Auto Book Scanner with LEGO ...

Superfast scanner lets you digitize a book by rapidly flipping pages

OCR

How does this work?

Using OCR with GRAMPS

Example and related informations

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Contributor help pages

wiki

Tools