Normalization is the process of checking expressions before they are entered into PanLex in order to promote orthographic consistency and screen out bogus expressions. For instance, in different sources the same expression might be written with whitespace characters (e.g. “chicken pox”), with a hyphen (“chicken-pox”), or as a single word (“chickenpox”). If you normalize such expressions, you create more connections in the database, facilitating translation inference. Alternatively, a source might contain strings such as “red dwarf”, “red ink”, and “red lipstick”. The first two are genuine lemmas (at least with certain senses), but the third isn’t. If you accept the former but revise the latter to “lipstick” with “(red) lipstick” added as a definition, you again enhance the ability of PanLex to support translation inference.
normalize serialization script interrogates the database to perform some of this normalization. For example, it can determine that “chickenpox” is the consensus form and that only one PanLex source attests “red lipstick” as an expression but many sources attest “red dwarf” and “red ink”. The script can convert deviant forms to preferred forms and can convert fatally deviant expressions to definitions.
Not all properties of expression texts are handled by
normalize. At present it doesn’t have access to language-specific data on lexical alternates. Instead, it relies on the PanLex degradation algorithm to identify similar forms.
Normalization is appropriate mainly for language varieties with many expressions already in the PanLex database. Typically, these are major world languages (English, French, Spanish, Chinese, etc.). The
normalize script within the
serialize.pl script permits a few parameters to be set by the editor (see below).The richer the data for a language variety are, the higher you can set the minimum-score parameters and get plausible results. The
normalize script can run on only one column at a time. If you want to perform normalization on more than one column, you can duplicate the line that calls
serialize.pl so it will be invoked multiple times, referring to a different column, and possibly with different parameters, each time.
To normalize or not?
You shouldn’t always use normalization, even on richly documented language varieties. If you are analyzing a trusted source, it may be better to skip normalization, so that the source is permitted to add new expressions to the database and dictate the forms of expressions.
Other than the column of the tabularization file to apply normalization to, and the language variety of the expressions contained within that column, the main input parameters for the
normalize script are as follows:
This is the minimum score (essentially, a number representing the frequency of the expression already in PanLex) that a proposed expression must have in order to be accepted outright as an expression. A proposed expression with a lower score than
min will be replaced by its highest-scoring degraded form (provided that this degraded form itself has a score exceeding a minimum value set by the parameter
This is the minimum score that the degraded form of a proposed expression must have in order to be accepted as an expression.
If a proposed expression does not have a score greater than
min, but has a degraded form with a score higher than
mindeg, the original expression will be retagged with the pre-normalized expression tag ‘⫷exp⫸’, and the highest-scoring degraded form will be tagged as an expression. For example:
⫷ex⫸chicken-pox => ⫷exp⫸chicken-pox⫷ex⫸chickenpox
The values of the
mindeg parameters will vary with the language variety. For English expressions – the most attested language for expressions in PanLex – reasonable values for
mindeg are 50 and 10, respectively, whereas for Spanish, values of 30 and 7 are more appropriate. However, these are not hard-and-fast parameter values, and the editor can experiment with
mindeg in order to get the best results.
This optional parameter is useful for determining the optimal values of
mindeg for a particular source. If set to
1, a log file in JSON format called
normalizeX.json (where X is the column index) will be produced containing the normalization scores returned from the PanLex API (see here for a description of the format). Exact scores are shown under
stage1, degraded scores under
Proposed expressions that fail to meet the minimum requirements of both
mindeg will not be accepted as expressions in PanLex. By default, these expressions will be converted to definitions and tagged with the ‘⫷df⫸’ tag. However, the editor is able to set the
failtag parameter to some other tag, if desired. For instance, if the editor is sure that the source is highly reliable and does not want to reject any low frequency vocabulary in the source that is not already attested in PanLex, she might set the parameter
failtag to ‘⫷ex⫸’. By doing this, the editor is trusting every expression in this language variety in the source to be an expression, but is allowing PanLex to substitute an alternative form (differing in letter case, spacing, diacritical marks, and/or punctuation) that is better-attested.
This optional parameter is a list of columns to which the change to
failtag should be propagated. It is useful if you expect that invalid expressions in one column will typically be invalid expressions in another column. For example, suppose that the English column sometimes translates a less common language variety’s expression with full sentences such as ‘the apple ripened’, ‘the bell jingled’, ‘the dog barked’, and so on. These are likely to be non-lemmatic in the less common variety as well. You can use use
normalize to detect non-lemmatic English expressions and propagate the change of ‘⫷ex⫸’ to ‘⫷df⫸’ to the other column.
Propagation requires that all the expressions fail normalization. If the column contains multiple expressions and at least one of them survives,
propcols doesn’t propagate.
This optional parameter is a regular expression matching any expression that should be ignored by the normalization process. For example, if a source contains capitalized expressions that are the proper names of local villages, gods, holidays, etc., it may be appropriate for these to be ignored by the normalization process and thereby accepted as expressions by PanLex even if these are not already attested in PanLex. In this particular case, the
ignore parameter would be set to ‘^[A-Z]’.
You can use the
ignore parameter to exempt particular expressions from normalization. Choose any mark that isn’t already in the source file, attach it to the expressions that you wish to exempt, and give the ignore parameter a value that will match it. For example, prefix the expressions with ‘ok:’ and set the value of
ignore to ‘^ok:’. This leaves the marked expressions as-is, marks and all. To remove the marks, use the
replace script, changing ‘\bok:’ to the empty string. (Using ‘^ok:’ would not work, since
replace replaces over the whole column, and the column begins with an expression tag.)
This optional parameter is a list of the IDs of one or more source groups that should be ignored by the normalization process. Typically, you should use this parameter if you are re-analyzing an already imported source. By specifying that source’s group ID, you can ensure that the scores assigned to its expressions are not inflated by the fact that the source is already in PanLex.
It is possible to run
normalize on the class expression of a meaning or denotation classification. To do so, you should set the
extag parameter (normally ‘⫷ex⫸’) to ‘⫷mcs⫸’ or ‘⫷dcs⫸’. If the class expression’s UID is already tagged you will have to specify it as well, for example ‘⫷mcs:eng-000⫸’.
You should also set the
failtag parameter. If you set
failtag to ‘⫷mpp⫸’ or ‘⫷dpp⫸’, class expressions that fail normalization will have their class expression converted to a property text and their superclass expression converted to an attribute expression. For example, ‘⫷mcs2:art-300⫸IsA⫷mcs⫸large spider’ would become ‘⫷mpp:art-300⫸IsA⫷mpp⫸large spider’. If you set
failtag to ‘⫷mcs⫸’ or ‘⫷dcs⫸’, only the class expression’s tag (including any language variety specified in
failtag) will be affected. If you set
failtag to any other tag, it will replace the class expression tag, while the superclass expression will be removed. For example, with
failtag set to ‘⫷rm⫸’, ‘⫷mcs2:art-300⫸IsA⫷mcs⫸large spider’ would become ‘⫷rm⫸large spider’. (The special tag ‘⫷rm⫸’ indicates that the tag and its contents should be removed in the final source file.)
The flow of normalization
In summary, the logic of the
normalize script is shown in this diagram (assuming the default option for failtag):