Tabularization methods | PanLex development

IntroductionUp

After source selection, retrieval, and text standardization, the next step in the typical assimilation workflow is tabularization.

Elsewhere we have documented the principles of tabularization and given an example. Here we document methods that editors commonly use in tabularization.

Interpretation

In some cases it is best to do some or all of the tabularization using the common methods of text and spreadsheet editing, including keyboard entry, copying, and pasting. Small and complex sources are sometimes easier to tabularize this way, in whole or in part, than by designing or adapting a program. There is more detailed documentation on interpretation methods.

Analysis

In other cases our editors mostly or entirely use automated tabularization (as is true also for serialization). Tools for this work include some created by the PanLex team. You may find it useful to develop your own new ones, either for one-time use on a particular source, to be stored with the source’s files, or, for repeated operations, to be stored in the PanLex tools. There is more detailed documentation on existing and custom tabularization scripts.

Delimiters

In simple cases, every column of your output file contains a unitary item in each row. But more complex situations exist in which you can use delimiters to partition an item. Any character can serve as a delimiter, as long as that is the only way it appears in the file. But PanLex has defined some default delimiters, and, if you use these, you will have less work to do in the serialization phase.

Multiple translations give rise to a need for delimiters. It would be impractical to insist that every translation of the same expression get its own column, because you can’t know in advance how many columns you might need. So, instead, you can put all the translations into the same column and separate them with delimiters.

As explained in the documentation on tabularization principles, you must distinguish multiple translations by whether they express the same meaning or distinct meanings.

If multiple meanings are in separate entries, you don’t need to do anything special with them except remove any attached serial numbers. If they are in the same entry, you can convert them to separate entries, as in the tabularization example (two rows for “asustar”). But you can instead insert (or leave, if already there) a meaning delimiter before each new-meaning translation. The default PanLex meaning delimiter is ⁋ (the reverse pilcrow symbol, at Unicode code-point U+204B). By the way, in case it doesn’t look right to you, your system needs at least one more font.

If multiple translations are synonymous, you must separate them with a synonym delimiter. The PanLex default synonym delimiter is ‣ (the triangular bullet, at Unicode code-point U+2023).

Many sources use commas as synonym delimiters and semicolons as meaning delimiters. But they often use commas for other purposes, too, and in such cases they are useless as delimiters for PanLex. An example:

couch, sofa, divan (in a home, hotel)
feed a stranger, guest or visitor

With PanLex tools you can deal with such behavior if it is consistent, such as by:

ignoring commas inside parentheses when replacing commas with the default synonym delimiter
converting commas to synonym delimiters only if all the items that they separate are single words

Output

The output of tabularization is one or more new files. For project coordination, please follow the PanLex file-naming conventions. Typically, the name of the text file of raw data has the form aaa-bbb-Author-0.txt and the tabularization stage output file has the form aaa-bbb-Author-1.txt. However, depending on the condition and the format of the data in the source you are working with, you may find it easier to carry out the tabularization in two (or more) stages, first creating an intermediate file of parsed raw data, and then using this to create the final tabularized output file in the correct format. In this case, the intermediate file will be named aaa-bbb-Author-1.txt and the final tabularization file will be named aaa-bbb-Author-2.txt. (You should document this process in the notes file for the source.)

File types

The strategies for tabularization principally depend upon the content structure of the file. Methods applicable to some common content structures are described here.

PDF

If you are working with a PDF file, it may be practical to convert it to a text file. Once this has been done, you can proceed to tabularize the text file using regular expressions.

Alternatively, if the typefaces used in the PDF document (e.g. the fonts, font sizes and colors) might allow you to extract certain data or distinguish between information that would not be possible using just the raw text, another option you have is to convert your PDF file to XML format and then carry out the tabularization by using a stateful XML parser.

HTML

If you are analyzing an HTML file and it has a simple structure, such as that of a table, one simple method is to copy and paste the content from a browser into a text editor or spreadsheet application. Sometimes this produces a multi-column tabular file ready for further tabularization or serialization.

More complex HTML files can be handled with HTML parsing libraries. They can make it easier to extract tabular data while making use of any valuable formatting markup. Among the parsing libraries that PanLex editors have found helpful are:

Perl: Mojo::DOM (jQuery-like API). To install it, like any other Perl module, you should follow the Perl module installation instructions.
Python: BeautifulSoup, or pyquery (jQuery-like API) with the BeautifulSoup parser (soupparser).
JavaScript: Node.js and Cheerio (provides an API similar to jQuery). jQuery is a JavaScript library that facilitates common client-side functionality for web sites, such as traversing HTML. Node.js is an implementation of JavaScript that can be run from the command line. Cheerio provides simple parsing and manipulation of HTML via the jQuery API. Together, they permit efficient extraction of data from HTML. To learn how, you can follow the cheerio tabularization tutorial.

XML

Many HTML parsing libraries support the analysis of XML sources as well. For particular XML formats (for example, RDF XML), there may be a library or module available that can read it.

Shoebox

SIL International has defined a Shoebox (MDF) standard for plain-text dictionary source files. Many sources conform to this standard. It employs field markers that SIL has documented. Field markers can have source-specific meanings, so you should check the source, too, for any explanations of what the field markers mean.

Plain text

“Plain text” is really a miscellaneous category. Often it refers to a file that uses tabs or punctuation to define structural elements. It may be already close to being a tabularized file, or may have a messy and inconsistent format, or be anywhere in between.

If the source document is already in or can be converted easily to a text format (for instance, a txt file, MS Word or Excel, or Toolbox/MDF), the first question to ask might be whether a tool already exists to help tabularize it. If the file has a common or standard format, such a tool may exist, either in the PanLex tools or elsewhere. Among the external tools that may help with some file standards are:

makedict (converts among several standard formats)
dictconv (found to be flaky by PanLex editors)
PorDiBle (converts between Palm .doc and plain text)
Sdict and Sdictionary (decompile Sdict or Sdictionary .dct dictionary databases)
DictUnifier (converts to or from StarDict format)

If no tool exists, or if the existing tools don’t do the entire job, you can copy and modify the main‑0to1.pl script from the PanLex tools to read the file line by line and parse the file using regular expressions. This script, like many in the tools, is written in Perl.

Start by copying the file to your working directory by typing the following at the command line:

plx cp main-0to1.pl

In a text editor, open the main-0to1.pl file, and change the basename of the source file to match the raw source text file that will be the input to this part of the process.

Parsing with regular expressions works very well for simple cases, and with a little ingenuity can be made to work for more complex cases too. For example, if your source file has expressions of language A in one column and expressions of language B in another, you could use regular expressions to match strings separated by a certain number of whitespace characters. (Be careful, however, that this minimum number is at least greater than any sequence of white spaces found within a single expression.) Alternatively, you might be able to exploit the consistent use of numerical delimiters in your source – or even the punctuation – for a regular expression match.

Yet another strategy would be to make use of any word class (part of speech) information in your source. For instance, a frequent format for source data is three columns:

Expression in language A
Part of speech
Expression in language B

In this case, you should be able to generate a list of the (probably few) parts of speech used (e.g. n, v, adj, and adv), and then use this information to match a line of data containing the regular expression 'n|v|adj|adv' between two other strings of characters.

Handy tip: First replace any whitespace characters within a given part of speech by an underscore. In other words, turn a complex part of speech such as ‘v personal’ into ‘v_personal’ before you run the regular expression match. If you do not, the regex is likely to match the string v by itself (assuming that v alone is also an attested part of speech in the source) and so the string personal will be treated as part of the expression in language B. Also, make sure that you use parentheses in your regular expressions to create capturing groups. In this way you can more easily parse the line into the language-A, part-of-speech, and language-B components. More details are available on regular expressions.

Printed and page-image sources

If you are working with a source that is in hardcopy format, or there is only a scanned image version of the source available, you may have a lot of work to do in order to get the source file into a format from which you can extract text. A set of presentation slides about Printed Source Analysis can introduce you to this task.

Normalization

Normalization can take place both during tabularization and during serialization. During serialization, the normalize script rewrites some deviant expressions and classifies some candidate expressions as definitions. Other kinds of normalization are performed during tabularization.

Normalization tasks during tabularization have been sufficiently idiosyncratic that we do not have a large set of tools to support this task. You can, however, reuse code in the tabularization scripts of already-assimilated sources, if your normalization task resembles one performed with another source. Some sources whose normalization code might be of use include:

mul:HeiNER (selective case lowering of plant and animal names)

For the normalization of Traditional and Simplified Chinese the PanLex staff has some resources available for limited use. For more information about these, ask the staff.

If you develop substantial and potentially reusable code to normalize expressions in a source, please notify the PanLex team so that it can be mentioned here.

Some projects have produced lists of lemmas in particular languages. For example, the Jacy project has produced a list (encoded in Japanese EUC instead of Unicode and UTF-8) of lemmas in Japanese. If such a list is more extensive than the inventory of expressions in the same language (variety) in PanLex, you can use such a list for normalization during tabularization.

Miscellaneous

The PanLex tools include several tabularization scripts, designed for some of the common formats that have been found in source files.

For some source types, tools developed in other projects may help with tabularization. If you find any to be particularly useful, please call them to the PanLex team’s attention, so we can give them more prominent places here. Tools that might be helpful with tabularization include:

Google Ngram Viewer (might help with normalization and language-variety identification)
New Updated Guthrie List (can help interpreting codes identifying Bantu languages)
Linguae (intended to parse dictionaries in various formats)
Chinese language-identification and word-classification files (available as needed to PanLex editors on request)
Moby Words (may help with validation of English expressions)
Open AI Resources (NLP and other AI tools and corpora)