Interpretation methods | PanLex development

Up
Interpretation is an assimilation strategy in which an editor performs manual tabularization. Interpretation may be contrasted with analysis, where an editor writes a program to perform tabularization. The tabular file format is the same in both cases, and both strategies are followed by semi-automated serialization with the PanLex tools.

Interpretation has its own advantages and challenges. The best method is not equivalent to mechanically performing the steps that a program would perform. Interpreters can use their judgment—for example, based on their knowledge of the languages in the source—in a way that programs cannot. On the other hand, programs excel at performing the same rote task on large amounts of data, which can be tedious for manual data entry. To maximize the advantages of interpretation and minimize the challenges, we have developed certain best practices, which are documented here.

Software

The first choice to make in interpretation is what software to use to enter the data. Spreadsheet programs such as Microsoft Excel, LibreOffice Calc, and Google Sheets are generally the best choice. You can also directly enter tab-delimited text with a text editor.

If you enter your data into a spreadsheet, you will need to export it create a tabular file. While it is fine to enter data into Excel, it is not able to directly export a correctly formatted tabular file. We therefore recommend that you use LibreOffice to export your Excel spreadsheet. The export procedure is as follows:

From LibreOffice: Choose Save As… and select Text CSV as the file type. After you click Save, a dialog box with further options will appear. Choose Unicode (UTF-8) as the character set and {Tab} as the field delimiter. Make sure Quote all text cells is not selected.
From Google Sheets: Choose Download As > Tab-separated values from the File menu.

Once you have saved the tabular file, you should delete the first line if it contains a header with column names.

Techniques

Since you will run your tabular file through serialization, you can make use of all available serialization scripts. Understanding how they work can save you time during data entry. Here are some useful techniques:

The exdftag script will remove definitional elements in parentheses. Therefore, if your source contains the string “paternal aunt”, you can enter “(paternal) aunt” in the relevant expression column and leave the definition column (if any) blank.
The normalize script will convert putative expressions to definitions based on their attestation in the database. Therefore, when you are uncertain whether something should count as an expression or definition, you can put it in the relevant expression column and leave the definition column (if any) blank.
The csppmap and mcsmap scripts map various abbreviations to the appropriate classifications and properties. You can customize the mappings by creating your own mapping file. Therefore, you can use whatever abbreviations you want in your tabular file, as long as they are unambiguous and can later be mapped.
The replace script allows you to perform regular expression replacements on any column.