Simplifying complex text formats

IntroductionUp

During assimilation we usually simplify the text that sources contain. Most of the simplification work takes place in the tabularization stage, where text is reduced to a matrix of rows and columns. But some text formats are complex enough to make us think of them as not ready for tabularization. Instead, we think of them as requiring text standardization before we begin tabularization.

Triage

Before you decide to simplify the format of the text in a source, verify that the source indeed contains text.

Source documents may look alike whether or not they contain text. A PDF file, for example, may contain only page images, or it may contain text as well as a superimposed image made from the text, without looking any different.

By attempting to copy some text from a source document into another document, you can determine whether any text exists and, if so, whether it has been added afterwards and/or contains errors. This investigation helps determine what standardization approach to use.

If it turns out that the document contains no usable text, your problem is not simplification of the text format, but conversion of images to text.

PDF

The most important complex text format that we need to simplify is the Portable Document Format (PDF).

This format is too complex to make it practical, in most cases, for analysts to manipulate the text fragments that appear in a PDF file. Instead, we usually use tools that convert PDF files to text files in more tractable formats, and then we tabularize those simpler files. The best destination format will depend on your source. The more format-based information you need to preserve, the more complex you need the destination format to be.

PDF files are sometimes constructed in a way that makes the output from conversion tools seriously defective. Layouts, characters, and character orders may be corrupted. It may be necessary to try several tools, and within each tool to try various options, in order to minimize the defects. If no PDF conversion tool can remedy the defects, you may need to use additional methods to deal with them after converting a PDF file, or you may conclude that the file is more economically consulted with the interpretation strategy rather than analysis.

Tools

PDF viewers

PDF viewers (such as Preview in OS X) generally permit you to extract text from PDF files. To do this, you usually select the text you want, copy it, and paste it into a text-editor window.

PDF editors

PDF editors (such as Adobe Acrobat) permit you to save PDF files in text formats simpler than PDF, including plain text, rich text, and HTML.

TET Plugin

TET Plugin is a third-party plugin for Adobe Acrobat that aims to make that program more effective and more customizable in its extraction of text from PDF files.

pdftotext

pdftotext is a tool included with the Xpdf library and also with the poppler library built on Xpdf.

The Xpdf library is available as source code and as binaries for several operating systems at the Xpdf website.

The poppler library can be installed as follows:

OS X: Install the homebrew package manager, if it is not already installed, by executing the command ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)". Then, execute the command brew install poppler to install poppler. If the installation fails because XQuartz is not present, follow the instructions to install it and then repeat the above command.
Windows under Cygwin: Run Cygwin’s setup.exe and install the poppler package, under the “Text” category.

To use pdftotext, open a command line and change to the directory that contains the PDF file you are interested in. The basic pdftotext command is pdftotext -enc UTF-8 -layout <filename>. You should replace <filename> with the name of your PDF file. An output file with the extension .txt instead of .pdf will be generated. Warning: If you already have a file with this name, it will be overwritten.

The -enc UTF-8 option specifies that the text output should be UTF-8. You should always include this option if it is not the default.

The -layout option indicates that the text output should follow the PDF’s visual layout as closely as possible. Otherwise, pdftotext will try to guess paragraph boundaries, which usually does harm with PanLex source files.

If your PDF file contains columns, you will probably find that they are preserved in the text output, rather than being unwrapped into a single column. You can often get single column output by using the -raw option, which causes the order of the text output to correspond to the raw stream order of the PDF file. (PDF files can specify the order of elements on the page in any order. They usually specify the first column’s text before the second column’s text.)

pdf2txt.py

pdf2txt.py is a tool included with the PDFMiner Python module. Denis Papathanasiou has published an extensive example of the use of PDFMiner.

Installation of pdf2txt.py is as follows:

OS X: Python should already be installed. Check if pip, a Python package installer, is already installed by running pip from the command line. If not, run the command sudo easy_install pip to install it. Finally, run sudo pip install pdfminer to install PDFMiner.
Windows under Cygwin: Run Cygwin’s setup.exe and install the python and python-setuptools packages, under the “Python” category. Next, run the command easy_install pip to install pip, a Python package installer. Finally, run pip install pdfminer to install PDFMiner.

Usage

The command to run pf2txt.py is pdf2txt.py -c UTF-8 -o <basename>.txt <basename>.pdf. If you wish to change the file name while extracting text from it, you can make the basenames different.

Pdf2txt.py tries to replicate the visual layout of the PDF file by default. Usually, this is helpful. You can disable this layout mode by including the -n option.

If you want pdf2txt.py to process fewer than all the pages, you can use the -p pgnum, pgnum, …, pgnum option. For a large range of contiguous pages, you can insert a command to generate a list, such as:

pdf2txt.py -c UTF-8 -o -p `perl -e 'print join(",", 10 .. 85)'` <basename>.txt <basename>.pdf

Plain text output

One type of output from pdf2txt.py is plain text. For this kind of output, you can open a command line window, change to the directory that contains the PDF file, and execute the command pdf2txt.py -c UTF-8 -o <basename>.txt <basename>.pdf.

You should replace <basename> with the name of your PDF file preceding the extension. The -o option specifies the name of the output file; you may choose a name other than <basename>.txt.

The -c UTF-8 option specifies that the text output should be in UTF-8. You should always include this option.

XML output

It is also possible to make pdf2txt.py output a file in XML format by including the option -t xml in your command. This format contains richer information than the plain-text format. It lets you use typographical information (e.g. typefaces, font sizes, colors, etc.) that the plain-text format would lose. In the XML file, every character in the original text is on a distinct line, and it is accompanied by information about its font, size, color, and position on the page, as in this example:

XML output can be useful for sources with content like this, where bold-face type, italic type, and indented lines make a difference: