Optical Character Recognition for Coptic (OCR)
This page describes, how you can convert scanned documents (for example older books) into text or word files using free tools (OCR = Optical Character Recognition).
Although there are a lot of very good and accurate commercial tools for OCR like finereader from ABBYY or Omnipage from NUANCE, but they all lack of convenient ways of adding new Unicode languages. Besides they are relatively expensive.
One of the popular free OCR-engines which supports Unicode is tesseract-ocr. There are several free front ends (GUI interfaces) for Windows like: VietOCR and Paperfile FreeOCR.
In this page I will describe in more details, how you can OCR Coptic documents, which means converting Coptic images (tiff, image pdfs) into Unicode Coptic Text.
You have 3 options:
- For achieving good results, scan the documents with a resolution of at least 300dpi. Save the scans in black-white tiff format. I would also recommend to use image programs to de-noise, deskew and clean the scanned image.
- Scan only documents containing pure Coptic text. Recognition quality of Coptic texts containing old fonts will be very poor, depending on the trained data.
- The overall performance can not keep up with commercial tools, but you will get a output file in Coptic Unicode.
- jTessBoxEditor (my favorite from the vietocr developer.
- bbtesseract with a very clear Gui and options.
- Tesseract version 2 (right click on link to download) - trained with several fonts.
- Tesseract version 3 (right click on link to download) - trained with several fonts.
- download and install VietOCR.NET
- open the file "ISO639-3.xml" which is installed (normally at C:\Program
Files\VietUnicode\VietOCR.NET\Data) with any editor and add the following line:
- download the Coptic files I have generated(see above)
and copy into the directory:
- Now, if you start VietOCR.NET you should be able to select Coptic as language. Use the Font menu to select a Coptic Unicode Font.
- Open a scanned document (tiff format, black/white, no compression), mark the region you would like to recognize and then press "OCR". You can use this sample for testing.
- If you are not satisfied with the results, you can optimize the recognition for your Font type. Follow the instructions described at tesseract home page for training your own data. CAUTION: it is time consuming!.