Optical Character Recognition for Coptic (OCR)
This page describes, how you can convert scanned documents (for example older books) into text or word files using free tools (OCR = Optical Character Recognition).
Although there are a lot of very good and accurate commercial tools for OCR like finereader from ABBYY or Omnipage from NUANCE , but they all lack of convenient ways of adding new Unicode languages. Besides they are relatively expensive.
One of the popular free OCR-engines which also supports Unicode is tesseract-ocr. It is an OCR-engine that was developed at HP Labs between 1985 and 1995 and is now available as open source at Google. Tesseract-ocr is only an engine. There are several front ends (GUI interfaces) written based on it like: VietOCR and Softi FreeOCR.
In this page I will describe in more details, how you can use VietOCR for converting Coptic documents into Unicode Coptic Text. But before you proceed reading, I would like to make you aware of the following importants facts:
- tesseract/VietOCR don't use complicated page analysis technique like those used in commercial tools. If you would like to get usable results, you must scan the documents in a good quality, us a resolution of at least 300dpi. I would also recommend to use image programs to denoise, deskew and clean the scanned image.
- Scan only documents containing pure Coptic text. Recognition quality of Coptic texts containing old fonts will be very poor, depending on the trained data. If your scanned document contains images, then you must mark the only-text regions manually before starting recognition.
- The overall performance can not keep up with commercial tools.
Follow the following steps if you want to make use of my trainded data along with tesseract and VietOCR.NET:
- download and install VietOCR.NET
- open the file "ISO639-3.xml" which is installed (normally at
C:\Program Files\VietUnicode\VietOCR.NET\Data) with any editor and add the following
line:
<entry key="cop">Coptic</entry>. - download the Coptic files I have generated from here
and unzip them in the directory:
C:\Program Files\VietUnicode\VietOCR.NET\tessdata. - Now, if you start VietOCR.NET you should be able to select Coptic as language. Use the Font menu to select a Coptic Unicode Font.
- Open a scanned document (tiff format, black/white, no compression), mark the region you would like to recognize and then press "OCR". You can use this sample for testing.
- If you are not satisfied with the results, you can optimize the recognition for your Font type. Follow the instructions described at tesseract home page for training your own data. CAUTION: it is time consuming!.
good luck!
last updated: 09.04.2009
Moheb Mekhaiel


