Source analysis tutorial

IntroductionUp

This tutorial will help you get acquainted with the PanLex tools, leading you through the process of source analysis to automatically extract and normalize data from a dictionary or other resource and produce a specially formatted file of data that can be imported into the database.

Prerequisites

This tutorial presumes that you have arrived here after reading the documentation on PanLex development, assimilation, in particular assimilation princples and assimilation tools. You will have seen links to PanLem, one of the tools used in assimilation, and we expect you will have read that, too, and studied its operation via its tutorial.

In order to run the tutorial on this page, you will need to have installed and configured the PanLex tools and their prerequisites, according to these instructions.

Steps

1. Download the sample resource document

Download tgl-deu-Grassau.zip and unzip it. Be sure to unzip it into its own directory, which should be named tgl-deu-Grassau. The zip file contains a sample resource document, tgl-deu-Grassau.html, containing translations from Tagalog into German. It is good practice to copy the entire folder and its contents to a location on your computer where you will work with this and other PanLex resources. Once you have done this, open a command line window and change the working directory to the tgl-deu-Grassau directory.

Open the tgl-deu-Grassau.html file in a web browser, and familiarize yourself with the information it contains.

2. Tabularization

You will see in the browser that the file says “Langenscheidt Vokabeltrainer” at the top. It so happens that there is already a tabularize Perl script in the PanLex tools for tabularizing files with this format: tabularize/html/vokabel-0to1.pl. You will need to copy this script into the working directory. To do so, at the command line (and making sure you are in the tgl-deu-Grassau directory) and run the command:

plx cp vokabel-0to1.pl

This command will copy the script into the working directory and rename it astotgl-deu-Grassau.vokabel-0to1.pl. (Note that tabularize scripts are conventionally named with the name of their source, since they are generally specific to it.)

Next, open tgl-deu-Grassau.vokabel-0to1.pl in a text editor. Near the top of the file, you will see two variables, $BASENAME and $VERSION. These appear in all tabularize and serialize scripts, and should be modified for the source you are currently working on.

$BASENAME is the name of the input file without its extension and version number. The plx command should have already set the value of $BASENAME to 'tgl-deu-Grassau' when it copied the file, since that was the name of the working directory the script was copied to.

$VERSION is a number indicating the version of the file specified in $BASENAME that the script should use as input. The script will look for a file named $BASENAME-$VERSION with the appropriate extension, in this case .html. The filename tgl-deu-Grassau.html obviously does not conform to this required format. You could modify the script to fix this. However, a better solution is to duplicate the original resource file and rename it as tgl-deu-Grassau-0.html. This allows you to hand-edit the file in case you encounter any issues that are difficult to solve with parsing, while still preserving the original resource document. Go ahead and create tgl-deu-Grassau-0.html in this way.

Now, look at the newly created tgl-deu-Grassau-0.html file’s source code (in your browser or in a text editor) and also look at the main loop in tgl-deu-Grassau.vokabel-0to1.pl. Try to understand what the code is doing, given the structural content of the source file. If you suspect the script will not parse the resource document correctly, you should either modify the script or resource document (whichever is easier). In this case, all looks good. Try it out by running the script. To do this, at the command line (again, first make sure that you are in the correct directory!) enter the command:

perl tgl-deu-Grassau.vokabel-0to1.pl

If successful, the script should output a file to the working directory named tgl-deu-Grassau-1.txt. Open this text file in a text editor and compare it with the original HTML file to be sure there are no character encoding or parsing issues. The output file should contain one line for each entry in the HTML file. Note that only the core data from the resource document (i.e. the bilingual translation pairs) has been extracted into the tabularized file.

You have now successfully tabularized a resource document: each entry is on a single line, with columns separated by tabs. However, the data will need to be serialized before the file can be imported into PanLex.

3. Serialization

The first step in serialization is to copy the serialize.pl Perl serialization script from the PanLex tools into the working directory. This script helps automate the process of running several serialize scripts in a row, and should be used in most (if not all) cases. To do so, open a command line, make sure that you are in the tgl-deu-Grassau directory, and enter:

plx cp serialize.pl

Next, open serialize.pl in a text editor. $BASENAME should already be automatically set to 'tgl-deu-Grassau'. Set $VERSION to 1, since that is the version that will be the starting point for serialization. In others words, the tabularization stage of the process resulted in an output file ending in version number 1.

The @TOOLS array in serialize.pl contains a list of serialize scripts to run, and their arguments. In general, a PanLex editor chooses which serialization scripts to run based upon the format and content of the tabularization file. There is detailed information on the scripts in the inline comments in serialize.pl and on the serialization scripts page. A script is activated by uncommenting the relevant line in serialize.pl (that is, removing the ‘#’ symbol at the beginning of the line) and customizing the script’s arguments.

For this tutorial we will use only three scripts: extagmnsplit, and out-full-0.

The first script, extag, goes through each line of the input and places a specified tag before each expression and meaning in the source. For this example, we can assume that any terms in the source separated by commas (such as ‘ánim, seis‘ in the second line of the tabularized file) are synonymous expressions in the language, whereas terms separately by a semicolon (such as ‘alles; allgemein‘ in the ninth line), have substantially different meanings.

Locate the extag line in serialize.pl, delete the initial ‘#’ to activate the script, and edit the line so that it reads:

'extag'  => { syndelim => '\s*,\s*', mndelim => '\s*;\s*', cols => [0, 1] },

Explanation of the script’s arguments:

  • syndelim => '\s*,\s*':  This is a regular expression that matches the assumed synonym separator (‘,’) between two words. (In general, the editor may have to do some investigation to determine whether something is a synonym or meaning separator. The source may also be ambiguous or inconsistent.)
  • mndelim => '\s*;\s*':  This is a regular expression that matches the assumed meaning separator (‘;’) between two words.
  • cols => [0,1]:   This is an array showing the indices of the columns in the tabularization file containing expressions (0 = first column, 1 = second column). In this case, the tabularization file has only two columns, and both of these contain expressions. However, in other cases the tabularization file might contain more than two columns, containing data that are not expressions, such as definitions, parts of speech, etc.

Note that the extag script will insert the standard expression tag ‘⫷ex⫸’ before every term it finds in the columns given by the cols argument, and the standard meaning delimiter tag ‘⫷mn⫸’ before every new meaning.

Now, try running serialize.pl with the command perl serialize.pl. It should produce the following output:

extag:         tgl-deu-Grassau-1.txt => tgl-deu-Grassau-2.txt

This means that the extag script was run with tgl-deu-Grassau-1.txt as input, producing the output file tgl-deu-Grassau-2.txt. This file will in turn be used as the input file for the next serialize script.

Open the generated output file tgl-deu-Grassau-2.txt in a text editor.  It should begin as follows:

⫷ex⫸araw	⫷ex⫸Tag
⫷ex⫸ánim⫷ex⫸seis	⫷ex⫸sechs
⫷ex⫸siyám⫷ex⫸nuwebe	⫷ex⫸neun
⫷ex⫸málaman	⫷ex⫸wissen
⫷ex⫸kúnin	⫷ex⫸fassen⫷ex⫸nehmen
⫷ex⫸tatló⫷ex⫸tres	⫷ex⫸drei
⫷ex⫸maliít	⫷ex⫸gering⫷ex⫸klein
⫷ex⫸waló⫷ex⫸otso	⫷ex⫸acht
⫷ex⫸lahát	⫷ex⫸alles⫷mn⫸⫷ex⫸allgemein
…

Now we need to make sure that the lines containing multiple meanings are split into multiple lines, one per meaning. To do this we will use the mnsplit script.  Find its line in serialize.pl, remove the comment marker #, and edit it so that it reads:

'mnsplit'      => { col => 1 },

The single argument indicates that the script should look in the second column of the input file for the standard meaning delimiter tag (which was inserted by the extag script), and split lines with multiple meanings into separate lines, one for each meaning. (Note that in this case there is no need to process the first column, as by inspection we can see that the meaning separator never appears there. However, in sources where more than one column contains meaning delimiters, you should run the mnsplit script twice by duplicating the relevant line in serialize.pl and changing the index of the targeted column.)

Run serialize.pl again. You will see that a new output file, tgl-deu-Grassau-3.txt, has been generated by applying the mnsplit script to the input file tgl-deu-Grassau-2.txt.

Open tgl-deu-Grassau-3.txt in a text editor to confirm that the ninth line in the excerpt above now reads as follows:

⫷ex⫸lahát ⫷ex⫸alles
⫷ex⫸lahát ⫷ex⫸allgemein

OK, we are almost done. The last step is to transform this file into a final source file. Final source files are UTF-8 encoded text files containing a format that can be parsed and validated by PanLem. They are generated by the serialize script out-full-0. (There used to be four such scripts, but there is now only one.)

Find the line in serialize.pl relating to out-full-0, remove the comment marker #, and edit it so that it reads:

'out-full-0' => { specs => [ '0:tgl-000', '1:deu-000' ] },

The arguments indicate which language varieties may be found in which columns: tgl-000 for Tagalog is in the first column (column 0), and deu-000 for German in the second column. Running serialize.pl one more time, we generate tgl-deu-Grassau-final.txt, which begins as follows:

:
0

mn
  dn
    tgl-000
    araw
  dn
    deu-000
    Tag

mn
  dn
    tgl-000
    ánim
  dn
    tgl-000
    seis
  dn
    deu-000
    sechs

…

(The colon and 0 on the first two lines are the header, present for historical reasons.)

The body of the file contains records separated by double-newlines. Each record contains a list of expressions sharing a meaning, indicated by mn. Each expression’s denotation, indicated by dn, is coded with a line containing the language variety’s uniform identifier (UID), then by a line containing the expression’s text.

You are now done. Congratulations! You have successfully edited a source document through the stages of tabularization and serialization. The final source file tgl-deu-Grassau-final.txt is now ready to be uploaded into the PanLem database. Since this is just an example file, we will skip that final step, but you can read about it on the importation page.