Assimilation strategies | PanLex development

Up
We perform assimilation in distinct ways, which can be called strategies. After you select a source, the rest of the work depends on the assimilation strategy. There are two main strategies: interpretation and analysis.

Interpretation

Interpretation is the activity of a human editor who reads a source’s content and, for each piece of it (typically each entry in a dictionary), decides:

whether to make use of it;
if so, then:
- what part of it to use;
- how to manipulate that part into data for the database.

The editor then carries out the work, creating (usually with a keyboard) new data based on the selected parts’ data.

There are several ways for an editor doing interpretation to get the resulting data into the database. The editor can:

Enter the new data into a spreadsheet and convert it to a tabular file, or enter it directly into a tabular file. This file is then semi-automatically serialized, so that a final source file is generated and then imported into the database.
Enter the new data directly into a final source file, which is then imported into the database. (This is usually less efficient than the previous method.)
Enter the new data into PanLem’s interface for meaning creation and editing. (This is usually efficient only for very small sets of data.)
Enter data into an interactive form, tailored for a particular source, that accepts all the details of one meaning at a time. (This is a hypothetical possibility not currently implemented.)

If you are doing interpretation, you should read the documentation on assimilation tools (including the PanLex tools), tabularization, interpretation methods, and serialization. You can ignore the aspects of tabularization that involve writing code.

Analysis

Analysis is the activity of a human programmer who:

examines a source’s content;
determines the content’s structure;
adapts existing programs and writes new programs to:
- select useful parts of the content;
- perform manipulations on the selected parts;
- generate a new document.

If you are doing analysis, you should read the documentation on assimilation tools (including the PanLex tools), tabularization, and serialization. You can write your tabularization scripts in any language that you are comfortable with. So far, most analysts have used Python or Perl.

Comparison

Interpretation and analysis use distinct methods, but they produce a similar result, namely new data based on but different from the source’s data.

Each strategy has advantages. Interpretation produces high-quality results if conducted by a competent editor. It is also economical for sources that are small or very complex. Analysis produces economical results for sources that are large and have simple, consistent structures. It is also more easily modified and repeated if we need to correct assimilation errors or we acquire a revised version of a source.

Most assimilation in the project has been done with the analysis strategy, but this has depleted the supply of sources amenable to analysis. Thus, we expect interpretation to become more common.

Combination

If you can do assimilation with the analysis strategy, you can also mix it with interpretation, employing a hybrid strategy. In this way you use both programs and human interpretation to accomplish the different tasks that they are suited for.

Combining the two strategies works as follows. You produce successive versions of the source file, sequentially numbered. In some cases the transition from one version to the next is produced by analysis, and in other cases it is produced by interpretation. It is usually most efficient to do the interpretation at the beginning of the sequence (if possible), so that the analysis can be repeated without having to redo the interpretation.

When you combine analysis with interpretation, the analytical transitions are documented in your scripts, while the interpretive transitions are not. It is advisable to include in a note file a description of the interpretive portions of your work.