Serialization scripts | PanLex development

IntroductionUp

The serialization stage of assimilation involves a multi-step process. The main step in that process is to choose a set of serialization scripts to call and to define their arguments. You do this by editing a copy of the serialize.pl script, which, in turn, can call the various special-purpose serialization scripts. This step requires some understanding of the purposes and behaviors of the scripts, documented in the serialize.pl file (in comments below the calling lines) and, when supplementary documentation is necessary, on this page.

The arguments to scripts include numeric column identifiers. The first column is always identified as column 0.

Contents

Tagging
Script documentation
Reserialization

Tagging

The serialization scripts use a standard tagging syntax for manipulating tabular files. Tags are opened and closed with two rare Unicode symbols: U+2AF7 (triple-nested less-than sign “⫷”) and U+2AF8 (triple-nested greater-than sign “⫸”). This is helpful, because these symbols almost never appear in sources, so by using them we avoid collisions between these meta-characters and ordinary characters. (If you are not seeing these symbols above—compare with the images on the linked pages—you should install fonts that cover most of Unicode. You will need such fonts anyway in order to view the text in some sources.)

Each tag has the following syntax:

⫷type:UID⫸content

The type indicates what kind of data it contains: ex for expressions, df for definitions, and so on. The UID indicates what language variety the tag data are in. The content is a string containing the data. For example, the tag ⫷ex:eng-000⫸book indicates the expression “book” in English.

There is also a reduced tag syntax which omits the UID and the colon before it. Many serialization scripts create reduced tags by default, deferring UID specification until the out-full-0 script.

We refer to the tags described above as simple tags. There are also complex tags, which consist of two interrelated simple tags in a row. Complex tags are currently used only for classifications and properties. For example, the complex tag ⫷dcs2:art-303⫸PartOfSpeechProperty⫷dcs:art-303⫸Noun indicates a binary classification where the superclass expression is “PartOfSpeechProperty” in art-303 and the class expression is “Noun” in art-303. (This is the standard way to classify an expression in a meaning as a noun.)

When you create a tabular file, it can be useful to pretag certain cells and columns. For example, if your tabularization script is able to classify some strings as definitions, then rather than placing these definitions in their own column, you can pretag them with ⫷df⫸. When you tag the column as containing expressions using extag, the pretagged cells will be ignored (and thus preserved as definitions). Another case where pretagging is useful is when a column generally contains expressions in one language variety but occasionally contains another—for example, species names in lat-003. If you pretag these with ⫷ex:lat-003⫸, the specified UID will override the column language variety you pass to out-full-0.

It is up to you to determine when pretagging is appropriate. Use whatever method is most convenient to produce a properly-formatted final source file.

Script documentation

apostrophe

The apostrophe script standardizes apostrophe-like characters. The most common such characters are:

U+0027: apostrophe (‘)
U+02BB: modifier letter turned comma (ʻ)
U+02BC: modifier letter apostrophe (ʼ)
U+05F3: Hebrew punctuation geresh (׳)
U+2019: right single quotation mark (’)
U+A78C: Latin small letter saltillo (ꞌ)

As a generalization, U+02BB is used in Uzbek and some Austronesian languages, U+2019 in European languages, U+A78C in indigenous Mexican languages, U+05F3 in languages that use the Hebrew script, and U+02BC by all other languages.

The U+0027 (apostrophe) character appears in many sources as a legacy substitute for one of the others. The apostrophe script looks up the language variety in PanLex, finds out which of these characters are recorded as standard ones for that variety, and converts U+0027 (apostrophe) to one of those characters.

In special cases, the script may not do what you want. For example, some language varieties have more than one standard apostrophe-like character, and borrowed words that contain apostrophe-like characters should, perhaps, use those that are standard in the lending language in those words. In such cases, you can convert any deviant characters during tabularization and use the script for all the rest. If you don’t, the script will choose a character approved for the language variety. If more than one is approved, it will choose the first approved one in this list: U+05F3, U+A78C, U+02BB, U+2019, U+02BC. It will choose U+02BC if none of these is approved.

extag

Each denotation in a tabular file is represented by the text of an expression. extag tags each denotation in the specified columns as type ex, and also inserts an empty tag of type mn before any denotation that has a different meaning from the one before it.

In some sources commas and semicolons are used for distinguishing synonyms and meanings, respectively. If the input file conforms to this convention and contains “chair, stool, bench; chairperson, head” in column 3, you can properly tag these denotations by specifying the arguments:

cols => [3], syndelim => ', ', mndelim => '; '

The script will then output:

⫷ex⫸chair⫷ex⫸stool⫷ex⫸bench⫷mn⫸⫷ex⫸chairperson⫷ex⫸head

However, in analysis (rather than interpretation), it generally makes sense to convert the source’s delimiters during tabularization to the standard PanLex synonym and meaning delimiters. The standard synonym delimiter is U+2023 (triangular bullet “‣”) and the meaning delimiter is U+204B (reversed pilcrow sign “⁋”).

If extag inserts any meaning-delimiter tags, you will also need to call mnsplit on the columns that contain them.

normalizedf

If your source includes complex would-be denotation expressions in well-documented language varieties, the normalizedf script may be useful. It compares each proposed expression’s degraded text with the degraded texts of all definitions in the same language variety. If it finds a match, it converts the expression’s text to whichever definition text has the highest attestation score. It still leaves the text tagged as an expression (⫷ex⫸).

For example, taking the English string “kind of tree”, the script script would query the API to find other English definitions with the same degraded text (“kindoftree” in this case), and their scores. The definition “(kind of) tree” would likely have the highest score, so the expression would be rewritten. Later, using exdftag, the expression “tree” would be extracted and “(kind of) tree” would be converted to a definition.

exdftag

The exdftag script is used to tag definitional portions contained within expressions, generally indicated by parentheses. Definitions are by default tagged as type ⫷df⫸. If you chose, during tabularization, to insert parentheses around non-critical or definitional information in expressions, the exdftag script handles this appropriately during serialization.

For example, if the tabular file has an entry such as “ataque (epiléptico)” in the column containing expressions, the successive application of the extag and exdftag scripts on this entry will result in an output of the form ⫷df⫸ataque (epiléptico)⫷ex⫸ataque. In other words, only the bare form of the lemma (“ataque”) is identified as an expression in the language. The original, full entry (“ataque (epiléptico)”) is retained as a definition.

The exdftag script can also deal expressions containing characters or patterns unlikely to occur in valid expressions, and with longer expressions that are more appropriately categorized as definitions (but were not indicated as definitions during the tabularization stage). You specify a regular expression matching prohibited expressions and the maximum allowable length of an expression, measured either in characters or in words. Any expressions matching the pattern or exceeding the maximum length are reclassified as definitions.

dftag

The dftag script is used to tag definitions contained within their own column of the tabular file. Definitions are by default tagged as type ⫷df⫸. The difference between exdftag and dftag is that the former assumes that definitions appear alongside expressions in the same column of data in the tabular file, demarcated by parentheses or some other pattern, whereas the latter assumes that definitions are in a separate column in the tabular file.

csppmap

The csppmap script converts the content of the specified columns to classifications or properties by looking it up in a mapping file. For example, if one of the columns contains the string “n.m.”, csppmap looks that up and replaces it with the mapping that it finds in the file, namely art-303:PartOfSpeechProperty:art-303:Noun‣art-303:GenderProperty:art-303:MasculineGender. The resulting specifications are then tagged as ⫷dcs2:art-303⫸PartOfSpeechProperty⫷dcs:art-303⫸Noun⫷dcs2:art-303⫸GenderProperty⫷dcs:art-303⫸MasculineGender.

When csppmap cannot identify a mapping in the file, by default it creates a property with the attribute expression art-303 “LinguisticProperty”. You can specify a different attribute expression. Other options are to leave the content unchanged or to delete it.

The default mapping file is stored in the panlex-tools directory as serialize/data/csppmap.txt (also available as the symlink cspp/csppmap.txt). You can create your own custom mapping file, either from scratch or by using the default file as a starting point. To copy the default file to your source directory so that you may modify it, you can run plx cp csppmap.txt.

Each line of the mapping file must contain three columns in tab-delimited format:

Column 1: the text to convert (all values in this column should be distinct)
Column 2: zero or more classifications
Column 3: zero or more properties

The values in columns 2 and 3 are in the same format as is required for classification and property columns. Multiple classifications and properties are separated by a customizable delimiter defaulting to ‣.

dcstag

The dcstag script tags strings in the specified columns (format) as denotation classifications. This means prefixing them (if necessary) and replacing their delimiters with tags. For example, dcstag would replace art-303:GenderProperty:art-303:VegetableGender with ⫷dcs2:art-303⫸GenderProperty⫷dcs:art-303⫸VegetableGender.

dpptag

The dpptag script tags strings in the specified columns (format) as denotation properties. This means prefixing them (if necessary) and replacing their delimiters with tags. For example, dpptag would replace art-303:phonemicRep:inoho with ⫷dpp:art-303⫸phonemicRep⫷dpp⫸inoho.

copydntag

The copydntag script takes a sequence of already-tagged denotation classifications and/or properties in the specified column and applies them to all of the expressions in the specified other columns. This script is useful when some fact about all the denotations of a meaning appears only once. In some cases, for example, grammatical classes such as “noun” appear once per entry but are intended to apply to all the expressions with that meaning.

mcsmap

The mcsmap script converts the content of the specified columns to meaning classifications by looking it up in a mapping file. For example, if one of the columns contains the string “domain:eng-000:physics”, mcsmap looks up “domain” and uses the conversion that it finds in the table, namely art-300:HasContext, to tag the string as ⫷mcs2:art-300⫸HasContext⫷mcs:eng-000⫸physics.

The default mapping file is stored in the panlex-tools directory as serialize/data/mcsmap.txt. You can create your own custom mapping file, either from scratch or by using the default file as a starting point. To copy the default file to your source directory so that you may modify it, you can run plx cp mcsmap.txt.

Each line of the mapping file must contain two columns in tab-delimited format:

Column 1: the text to convert, when preceding a delimiter (all values in this column should be distinct)
Column 2: one classification

The value in column 2 is in the same format as is required for classification columns. Note that the delimiter in column 2 and the within-classification delimiter used in the tabular file (the intradelim parameter) must match. The default value is : (colon).

mcstag

The mcstag script tags strings in the specified columns (format) as meaning classifications. This means prefixing them (if necessary) and replacing their delimiters with tags. For example, mcstag would replace art-300:MadeOf:eng-000:steel with ⫷mcs2:art-300⫸MadeOf⫷mcs:eng-000⫸steel.

mpptag

The mpptag script tags strings in the specified columns (format) as meaning properties. This means prefixing them (if necessary) and replacing their delimiters with tags. For example, mpptag would replace art-301:identifier:23044 with ⫷dpp:art-301⫸identifier⫷dpp⫸23044.

mnsplit

The mnsplit script converts tabular file lines containing multiple meanings into multiple lines, one line per meaning. It should be used if any ⫷mn⫸ tags were inserted by the application of the extag script.

The script makes as many copies of an entry as there are ⫷mn⫸ tags in the specified column of the entry. Each copy contains a distinct segment of what was originally in that column, where the segments are the parts delimited by the ⫷mn⫸ tags.

normalize

The normalize script judges denotation or binary classification expressions in a given column, to determine whether they appear to be valid (in their language variety), as measured by a score reflecting the expressions’ existing attestation in the PanLex database. Each expression without a high enough score is then compared with all similar expressions in its language variety. Whichever of the expressions has the highest score, if it’s high enough, is accepted as the expression. For example, if the source uses “pine-apple” as an expression but “pineapple” has a higher score, then normalize can replace “pine-apple” with “pineapple”. If neither the proposed expression nor a similar alternative has a high enough score, the expression can be demoted. A denotation expression can become a definition; a class expression can become the value of a property, where the superclass expression is also changed to an attribute expression.

To use normalize effectively, you should review its detailed documentation.

spellcheck

The spellcheck script provides access to the multilingual spelling-correction libraries aspell and hunspell.

replace

The replace script does a regular expression search and replace inside a column.

One simple use of replace is to delete marks that you have used to protect expressions from normalization. This method is discussed in the documentation on normalization. For this purpose, you define from as the mark that you used, and you define to as an empty string.

Here is a more complex example. Suppose you want to accept very well-attested expressions; replace other expressions with well-attested respellings, if there are any; downgrade any remaining strings that are longer than some maximum to definitions; and accept the remaining poorly-attested or unattested strings as expressions, because you trust this source to introduce new valid expressions. You can do this as follows:

Apply exdftag to the column. Don’t define any maximum lengths.
Apply normalize to the column, defining the parameter failtag as ⫷fail⫸. This tags poorly attested or unattested strings with this tag.
Apply exdftag again, this time assigning the value ⫷fail⫸ to the parameter extag. Thus, it will operate only on the strings tagged with ⫷fail⫸; those tagged with ⫷ex⫸ or ⫷df⫸ will be left alone. Set values for maxchar and maxword. This converts candidates that normalize rejected to definitions if they are overly long and leaves the ⫷fail⫸ tag on the rest.
Apply replace to the column, setting from as ⫷fail⫸ and to as ⫷ex⫸.

out-full-0

The out-full-0 script generates a final source file from the output of the scripts described above. To use it, you need to specify the columns in the file that are language-variety-specific (those containing expressions, definitions, or not fully specified classifications) as arguments to the script, giving the UID for each of those columns’ language variety.

The editor can direct out-full-0 to ignore entries in the file if these entries do not contain a more than a minimum count of expressions (minex) or of a combination of expressions and definitions (mindf).

The special tag type rm will be removed by default from any line in which it occurs. This is occasionally useful in more complex workflows.

out-full-0 performs error checking in order to avoid some common errors prior to submission. These include the following:

blank lines (for example, an expression with a blank text)
lines containing Unicode characters prohibited by PanLex
lines containing U+0027 (apostrophe)
lines containing an improperly corrected ellipsis (for example, three periods instead of U+2026 “…”)
lines containing “⫷” or “⫸” (for example, because you forgot to run mnsplit)
lines containing unbalanced parentheses or brackets
binary classifications and properties where the superclass or attribute expression is not in an immutable language variety