Tabularizing a PDF by converting it to XML

If your PDF contains formatting information (styles, colors, font differences, etc.) that provides crucial cues to its structure, one tabularization method is to convert the PDF to XML and use a stateful parser to process the XML.

Consider the following typical entries from a Spanish–Zapotec dictionary:

revivir

The general principle that the script should read in the generated XML file one line at a time, and then, based on the font, font size, and other attributes of the character in question, keep track of the state that the system is currently in (i.e. the type of information being read, such as in this example, a Spanish word, a Zapotec word, a part of speech, etc.). The script should also keep track of the strings of characters accumulated in each state when the system reads a character in that state.  With every subsequent line of the XML file that is read, the character in that line will either be appended to the accumulated string corresponding to the current state (assuming there is no change of state), or will be added to the string corresponding to the new state.

In any given source there are likely to be only a limited number of possible transitions between the information states. The XML tabularization script should be written to reflect the possible transitions between the states. For example in the bilingual source above, Spanish words can only be followed by either a part of speech or by a numerical meaning delimiter, whereas Zapotec words can only be followed by a numerical meaning delimiter, parenthetical/definitional information, or the next Spanish word.

The script should include an instruction to the effect that upon encountering a particular transition between states, the accumulated strings of characters are written (delimited by tabs) to the tabularization output file, and all accumulated strings are reset to be empty. In the example above, for instance, the script should contain a command to write the accumulated strings to the tabularization output file upon seeing a transition from either the Zapotec or Parenthetical state to the Spanish state.  In other words, upon transitioning from the final character ‘a’ in the Zapotec word ‘checa’a‘ to the first character ‘r’ in the Spanish word ‘revolcar‘, the line ‘revocar – vt – checa’a‘ is written to the tabularization output file, and all strings are reset before the ‘r’ of ‘revolcar‘ is added as the first character of the new Spanish state string.

Depending on the source, the instruction to write the accumulated strings to the output file may need to be a little more complex. For instance, in the example above, there should also be an instruction to output the accumulated strings upon encountering a transition from the Zapotec state to the Numerical Delimiter state. In this way, the tabularization output file will correctly be sent the line ‘revolcar – vt – chchix chtole‘. However, in this case, the accumulated Spanish state string should not be reset, since ‘revolcar‘ must be repeated as the Spanish expression in the subsequent line in the tabularization file.

Identifying the allowable transitions between states, and determining which transitions result in the writing of lines to the output file and the resetting of all or some accumulated strings will be a challenge for the source editor.

Common Problems

A potential issue is that there is not a unique font or font size corresponding to each state. In fact, we saw this in the example above where both Zapotec words and numerical meaning delimiters were in ‘LongZapSILDoulosL’ with a font size of 12.902. The workaround is to build in an additional criterion or two into the conditional statement in the script. The criterion may depend upon the value of the character currently being read (e.g. whether or not it is a numeral), or it may depend upon the previous state that the system was in (e.g. if the character currently being read is a period and the previous state was Numerical Delimiter, then the current state is also Numerical Delimiter).

Another issue that the editor may encounter is that dictionary headings, page numbers, etc, are in the same typeface as the real data that needs to be extracted from the source, as in this example:

headings

The fix is to incorporate information from XML file about the positioning on the page of the main body of data. Search the XML file for characters included in the headings and page numbers and note the maximum and minimum values of the y-coordinates given in the bounding boxes for these characters. Then, to only extract the real data from the source, add a condition in the script to ignore lines of the XML file that have characters contained in bounding boxes below these values (or above these values, if you are dealing with page numbers at the bottom of the page).

Another common cause of potential problems relates to cases where the PDF source document contains idiosyncratic typesetting errors. For example, some parentheses may be shown in regular font, but others are in italics. Or perhaps the first character of a word that should be in bold is inadvertently in regular font. Such erroneous typographical errors are likely to result in the XML parser incorrectly transitioning between states, and so result in errors in what is written to the tabularization output file. It may be possible to correct some of the consistent errors automatically, for example by treating all parentheses as belonging to the same state, regardless of the typeface used. However, there will likely remain manual fixes to be made in a second stage of the tabularization process.

A final warning note is to watch for the font and font size attributes in the XML file for whitespace characters. Erroneous typeface information about whitespace characters can seriously mess up the transitions in the XML parsing process. The best practice is generally to include an instruction in the script to ignore all whitespace characters unless they are in same typeface as the preceding character. In other words, you should be sure not to ignore bona fide whitespace characters that appear between words in a multi-word expression in the same language, such as ‘yej ros‘ as the Zapotec translation of the Spanish expression ‘rosa‘ above.

Further information

There is a presentation on how tabularization using XML parsing can be implemented in Python.