Tabularization scripts and tables

IntroductionUp

Automated tabularization makes use of scripts and tables designed to deal with commonly encountered problems.

PanLex tool tabularization script templates

The PanLex tools contain a number of pre-composed Perl scripts available for editors to use in the tabularization process. These are as follows:

main/*: Scripts that have been most often used in the analysis of source files. Editors typically start here when there is no reason to use one of the other scripts.
html/*: Scripts that have been used in the analysis of HTML source files, including files that are concatenations of multiple HTML pages (typically acquired with the curl or wget utility).
xml/*: Scripts that have been used in the analysis of XML files, including Apertium and LL-LIFT source files. Even though there is purportedly a standard for the LL-LIFT format, in fact files labeled as LL-LIFT files differ substantially in format, and four different series of scripts (a, b, c, and d) have been developed to deal with the differences.
misc/*: Scripts that have been used in the analysis of miscellaneous types of source files, notably MDF (the format used by SIL’s Toolbox software).
arabic/*: Scripts that were used in the analysis of eng-ckb:Mehmûdî, a source that included Kurdish written in the Arabic script.

PanLex tool tables

The PanLex tools contain tables for use in the tabularization process:

data/isoconv.txt: This file is a table that maps ISO 639-1 alpha-2 language codes to PanLex language varieties. It can be used in tabularization scripts for the analysis of multilingual sources where these alpha-2 codes identify the languages of expressions. The mapping in this table is not guaranteed to be correct for every source. If you discover a discrepancy for a particular source, you can copy the table into your source directory and amend the table. If you believe that the table contains a systematic error, please report it to the PanLex team.

PanLex::Util subroutines

The PanLex tools include the PanLex::Util module, which contains subroutines for some common tabularization tasks:

Trim: Deletes superfluous spaces from a string. It takes one argument: the string. This subroutine assumes that the string may contain spaces, tabs, standard PanLex synonym delimiters (‣), and standard PanLex meaning delimiters (⁋). It collapses all multiple spaces to a single space each, deletes any leading and trailing spaces, and deletes any space immediately before or after any tab, synonym delimiter, or meaning delimiter.
NormTrim: Deletes superfluous spaces from a string, like Trim, and also normalizes the characters in the string in accord with the PanLex standard (what the Nml subroutine in plxu.cgi does). It takes one argument: the string. This subroutine prevents errors that may arise from the existence of multiple expressions in the same language variety for the same meaning. If the expressions differ in ways that disappear with character normalization, then NormTrim will make them identical, and this duplication can be eliminated during tabularization or serialization.
Delimiter: Replaces delimiters not found inside parentheses with a standard delimiter. Takes three arguments: the input string, a string containing a set of delimiters to match, and the standard delimiter. Any whitespace surrounding matched delimiters will be removed. For example, Delimiter('one / two, three', ',/', '‣') returns the string one‣two‣three.
DelimiterIf: Replaces delimiters with a standard delimiter in the same way as Delimiter, but only if the delimited expressions all meet a particular condition. Takes four arguments: the same three arguments as Delimiter, plus a subroutine reference. The subroutine takes the delimited expression as its argument and returns a true value if the condition is met, false otherwise. For example, DelimiterIf('table, chart', ',', '‣', sub { $_[0] =~ '^[^ ]+$' }) returns the string table‣chart, whereas DelimiterIf('table, in a book', ',', '‣', sub { $_[0] =~ '^[^ ]+$' }) returns the input string.
ExpandParens: Expands an expression containing an optional parenthesized portion or portions into two or more expressions separated by the standard synonym delimiter. Takes one argument: a string containing an expression with zero or more optional parenthesized portions. For example, ExpandParens('(s)he') returns she‣he and ExpandParens('(in)flammable(s)') returns inflammables‣flammables‣inflammable‣flammable.
EachEx: Applies a function to each expression in a list of expressions delimited by the standard PanLex synonym and meaning delimiters. Takes two arguments: the delimited string and a reference to a subroutine that will transform each expression and return the result. For example, EachEx('One‣Two⁋Three', sub { lcfirst $_[0] }) will apply lcfirst to each expression, returning one‣two⁋three.
Dedup: Deletes duplicates from a delimited string. It takes two arguments: the string and the delimiter. For example, Dedup('abc:def:abc:defg:ab', ':') returns the string defg:ab:abc:def. The elements of the de-duplicated string are returned in an unpredictable order.

The above subroutines may be imported with the following code:

use lib "$ENV{PANLEX_TOOLDIR}/lib";
use PanLex::Util;

Custom scripts

In addition to using PanLex tools and other available special-purpose tools, you can write your own scripts analogous to those in the tabularization directory, to perform tabularization. We offer some tips for doing this in Perl, but you can do it in any programming language you prefer.

We suggest that you use the following code at the beginning of every Perl script you write:

use strict;
use utf8; # interpret literal strings as UTF-8
use open IN => ':crlf :encoding(utf8)', OUT => ':raw :encoding(utf8)'; # set default 'open' layer to UTF-8 and LF newlines
binmode STDOUT, ':encoding(utf8)'; # print output as UTF-8
binmode STDERR, ':encoding(utf8)'; # print errors as UTF-8

The idiom for opening a file and reading it line by line is as follows:

open my $in, '<', 'test.txt' or die $!;

# loop through the file's contents line by line
while (<$in>) {
   chomp; # strip the final newline character

   # put the rest of your code here
}

close $in;

The idiom for reading in a file all at once into a scalar (string) is as follows:

my $txt = do { local $/; <$in> };

The proper way to read in the columns of a tab-delimited line is as follows:

while (<$in>) {
    chomp;
    my @col = split /\t/, $_, -1; # don't do split /\t/;

    # example of doing a regex substitution on every column
    s/foo/bar/g for @col;

    # output the columns in tab-delimited format
    print $out join("\t", @col), "\n";
}

Retabularize script

This tool may be useful when you wish to improve a source that has already been imported into PanLex. Suppose you wish to make systematic (i.e. not isolated) improvements to a source. You can modify and resubmit the source files in the source archive to do this, but that method is risky: it destroys any changes that editors have already made after the previous importation.

To preserve any already-made changes while systematically improving a source, you can retrieve a semitabular source file from PanLex with PanLem. Just choose “improve” (rather than “customary”) in the “file—get” dialog. The semitabular format is designed to facilitate editorial or automated improvements.

It is sometimes useful to modify or create new tabularization scripts, or modify the serialization parameters, in order to make improvements to a source. You can’t use those scripts on a semi-tabular file. But you can use them on a tabular file. so, by using retabularize.pl, you can recreate a tabular file to which to apply such scripts, producing a new final source file. Thus, in general, using retabularize.pl and then the usual tabularization and serialization scripts allows more complex improvements than doing search-and-replace modifications and then the reserialize.pl script.

The script requires two arguments: the file’s basename and the current version. Typically, your file to be retabularized is the one produced by PanLem, with a name such as aaa-bbb-Author-0.txt. Then you can run the script with the command retabularize.pl aaa-bbb-Author 0. It will output the tabular file as aaa-bbb-Author-2.txt.

A warning is necessary before you use this approach. The retabularize.pl script produces a tabular file, but it may not contain a set of columns identical to those that the existing scripts were written to work on. The column orders may be different. All expressions that were formerly within the same entry but had distinct meanings and were operated on with mnsplit will be in distinct entries (so, unless you add expressions to an entry with distinct meanings, mnsplit will not be needed). In addition, no wc or md column is ever produced. Instead, word classifications and metadata are always appended to the expressions that they belong to. The reason for this last rule is that otherwise an indefinitely large set of columns might be required to accommodate synonymous expressions, their word classifications, and their metadata.

Work in progress

English infinitive normalization

Normalization of verb expressions in English could be facilitated by a table of determiners and other non-verb words that can follow an initial “to”. This would permit automatic normalization of English expressions beginning with “to”. Where it is an infinitive marker, the marker would be removed and a denotation classification classifying it as a verb would be added.

A table of 1,005 words following initial “to” in existing English expressions in PanLex (as of late August 02015) shows that most should be retroactively converted as stated above. If it were annotated to identify those that should not be converted, the above-described table could be derived and a normalization routine using it could be created.