Converting to digital text | PanLex development

IntroductionUp

Most of the world’s bilingual and multilingual dictionaries are printed on paper, so some of the sources acquired by PanLex are non-digital, and others are digital without text (made of images of pages). Assimilating those sources requires us to digitize the text of any information that we intend to include in the PanLex database.

Triage

Before you decide that a source document (for example, a PDF file) is a candidate for conversion to digital text, check on whether it really is what it appears to be.

Does it look as if it consists merely of page images? Try selecting text from it to verify that there is no text behind the image.

Does it look as if it contains text as well as images? Try copying some of the text and pasting it into a text editor. If the resulting text is unrecognizable, if some of the characters are incorrect, or if the flow deviates from the logical flow in the document, the text in the document may have been inserted after the document was created by automatic recognition. In that case, evaluate what value, if any, that text adds to the image, and whether you can do better with another attempt at automatic text recognition. If you can copy and paste text and it looks generally correct, then the document is probably not, or at least does not need to be treated as, a set of page images. Instead, your task is simplification of the format of the text that already exists.

Human text recognition

The most obvious method for creating digital text from page images is for a person who can read the source to do so and enter the relevant items of information as digital text into a file.

This is not always the most practical method, but often it is. Experience indicates that human text recognition is a good method when sources are small, complex in layout, or made of images showing complex and rarely used scripts.

Automatic text recognition

Under some conditions, the job of converting page images to digital text can be effectively automated. Where the images have high resolution and show characters in the Latin or Cyrillic script with no or common diacritical marks, it is rational to use tools that convert images to text, usually called optical-character-recognition (OCR) programs, because they make few errors.

Research on OCR (e.g., Dreuw et al., 2012) has explained some deficiencies in the simplistic and linear models underlying the main commercial and open-source OCR products. Successful automatic interpretation of bi- and multilingual dictionaries also requires that the layouts peculiar to dictionaries be parsed (Kanungo and Mao, 2003; Ma et al., 2003; Karagöl-Ayan, 2007).

In principle, OCR tools can be trained, or customized, to convert text in additional scripts, but, before you try to do that for a source, you should estimate the effort and evaluate the cost and benefit of doing so.

Among the available OCR tools are:

Heliński et al. (02012) and Mabee (02012a) have performed comparative evaluations of Tesseract and ABBYY FineReader. There is a more extensive report on the issues arising in Mabee’s use of these OCR applications with multilingual dictionaries.