coptic cross
Moheb's Coptic Pages

Optical Character Recognition for Coptic (OCR)


This page describes, how you can convert scanned documents (for example older books) into text or word files using free tools (OCR = Optical Character Recognition).

ocr scree shot

Although there are a lot of very good and accurate commercial tools for OCR like finereader from ABBYY or Omnipage from NUANCE , but they all lack of convenient ways of adding new Unicode languages. Besides they are relatively expensive.
One of the popular free OCR-engines which also supports Unicode is tesseract-ocr. It is an OCR-engine that was developed at HP Labs between 1985 and 1995 and is now available as open source at Google. Tesseract-ocr is only an engine. There are several free front ends (GUI interfaces) for Windows like: VietOCR and Paperfile FreeOCR.

In this page I will describe in more details, how you can OCR Coptic documents, which means converting Coptic images (tiff, image dpfs) into Unicode Coptic Text.
You have 3 options:
But before you proceed reading, I would like to make you aware of the following importants facts:
  • these tools don't use complicated page analysis technique like those used in commercial software. If you would like to get usable results, you must scan the documents in a good quality, use a resolution of at least 300dpi. I would also recommend to use image programs to denoise, deskew and clean the scanned image.
  • Scan only documents containing pure Coptic text. Recognition quality of Coptic texts containing old fonts will be very poor, depending on the trained data. If your scanned document contains images, then you must mark the only-text regions manually before starting recognition.
  • The overall performance can not keep up with commercial tools.

Tesseract

You can download tesseract at: http://code.google.com/p/tesseract-ocr/. Follow the instructions there to install. Tesseract can not originally recognize Coptic. You must train it first. This means that you must have some sample tiff-pages with a coptic text. You must then "tell" tesseract, which letter is found where in the image. On the home page of tesseract the training process is described in more details. The more we let tesseract learn different samples with different fonts, the better is the overall recognition quality. Training it for just one font type will lead to almost perfect recognition quality but only for this type of font. The process of training is some how tedious. There are some graphical tools that simplify somehow the training process like bbtesseract. I have already done training for the most font types I know. You can get my files here. You can also use these files as a start point for more training or more adaption to certain fonts.

VietOCR.NET

Follow the following steps if you want to make use of my trainded data along with tesseract and VietOCR.NET:
  • download and install VietOCR.NET
  • open the file "ISO639-3.xml" which is installed (normally at C:\Program Files\VietUnicode\VietOCR.NET\Data) with any editor and add the following line:
    <entry key="cop">Coptic</entry>.
  • download the Coptic files I have generated from here and unzip them in the directory:
    C:\Program Files\VietUnicode\VietOCR.NET\tessdata.
  • Now, if you start VietOCR.NET you should be able to select Coptic as language. Use the Font menu to select a Coptic Unicode Font.
  • Open a scanned document (tiff format, black/white, no compression), mark the region you would like to recognize and then press "OCR". You can use this sample for testing.
  • If you are not satisfied with the results, you can optimize the recognition for your Font type. Follow the instructions described at tesseract home page for training your own data. CAUTION: it is time consuming!.

FreeOCR

Working with the FreeOCR front end is similar to VietOCR above. You can get FreeOCR at: Paperfile FreeOCR. The Coptic Files must be unzipped to C:\WINDOWS\tessdata

good luck!




last updated: 24.08.20010
Moheb Mekhaiel

email