Lemmatization

IntroductionUp

When you assimilate a source, the most basic information that you document from it is translations among expressions. You identify each expression with two attributes: a language variety and a text. The text is, generally, the lemma of a lexeme, namely the dictionary or citation form of a word or phrase, written in the conventional orthography of its language variety.

Often, PanLex sources do the same thing, documenting translations among lexemes and representing the lexemes as lemmas. For example, a source may say that English “fragment” can be translated into Yoruba as “ẹ́rúnrúnibùèrún”. If you have no reason to suspect this, you can use the lemmas as-is, as the texts of expressions. If, however, a source tells you that English “hash” is translated into Bulgarian as “хеширане (резултат от)”, you should suspect the latter isn’t a lemma and do something different.

Grammatical indicators

Some sources add grammatical information to conventional lemmas to represent lexemes. Here are some examples:

pomotu:    a windshield
Wallova:   the sun
elentah:   to fall asleep
¡fogu!:    alas!

An appropriate analysis of such a source converts the nonlemmatic forms to lemmatic ones, while also capturing the grammatical information. One method is to replace grammatical particles and punctuation marks with word-class annotations. Assuming that you have read each line into an array of columns named @col, statements doing that in this case might be:

$col[1] =~ s/^a /⫷wc:noun⫸/;
$col[1] =~ s/^the /⫷wc:noun⫸/;
$col[1] =~ s/^to /⫷wc:verb⫸/;
$col[0] =~ s/^¡(.+)!$/$1⫷wc⫸ijec/;
$col[1] =~ s/!$/⫷wc⫸ijec/;

This would result in a tabularization output such as:

pomotu          ⫷wc:noun⫸windshield
Wallova         ⫷wc:noun⫸sun
elentah         ⫷wc:verb⫸fall asleep
fogu⫷wc⫸ijec   alas⫷wc⫸ijec

Later, during the serialization process, one of the scripts you would use on this output would be wcshift. It would move the preposed word-class specifications to the right sides of their expressions and convert them to standard PanLex word-class tags (e.g., changing “⫷wc:noun⫸sun” to “sun⫷wc⫸noun”.

Alternatively, you might wish to write a line or two of code in your tabularization script to isolate word-class information in its own column, resulting in output like this:

pomotu            windshield    noun
Wallova           sun           noun
elentah           fall asleep   verb
fogu      ijec    alas          ijec

Which method is easier or more convenient depends on the source and on your coding style.

You might also decide that “the sun” and “a windshield” should not be treated alike, since “sun” as a common noun denotes a class of stars while “the sun” names a particular star. If you found this always true for this source (“the moon”, “the universe”, etc.), you could modify the second statement above to one of the following, producing the output shown below it:

$col[1] =~ s/^the /⫷wc:name⫸/;
Wallova         ⫷wc:name⫸sun

$col[1] =~ s/^the (.+)$/$1⫷wc⫸name⫷df:eng-000⫸(the) $1/;
Wallova         sun⫷wc⫸name⫷df:eng-000⫸(the) sun

We find grammatical information bundled with lexemes in some sources that translate between grammatically divergent languages. Such sources might give us translations such as:

enui [v]      be bored
ebriiĝi [v]   get drunk
dormadi [v]   keep on sleeping

This can become more extreme, as in:

quinnonotz    he admonished them

The PanLem team has developed standard practices to cover some such situations; see below. Otherwise, you must exercise your judgment as to what constructions are lemmas.

Morphologically inflected forms

You may deal with sources that represent lexemes with inflected forms instead of, or in addition to, the forms generally considered their lemmas.

One pattern is for a source to provide not only an entry for the lemmatic form but also additional entries for various tenses, persons, numbers, genders, etc. In such cases our practice is to assimilate the lemmatic entries and ignore the others. For example, if in addition to the above entry translating a word into “he adminoshed them” there were also a translation of a related word into just plain “admonish”, you could ignore the entry with ºhe admonished them”. This policy may not, however, be easy to automate, and it may require language knowledge that you don’t have.

Another pattern is for a source to bundle inflections in with the lemmatic form in a single entry, such as “rojo/-a(-s)” (meaning “red”) in Spanish. Our aim is to document the lemmatic form only (in this case “rojo”) and ignore the rest.

When you deal with a particular case using a general rule, it’s good practice to check on how the rule works on other cases. If, for example, you eliminated all final instances of “/-a” in the Spanish column, then you would be doing that not only for “rojo/-a” but also for “niño/-a”, but you might consider the latter to represent two distinct lexemes (meaning “boy” and “girl”), unlike the former. Even if you figured out how to distinguish such cases automatically, you would still need to decide how to treat the multi-lexeme cases. For example, should you treat “niño” and “niña” as synonyms or as expressions with distinct meanings? You would need to decide whether to deal with an entry like

çocuk     niño/-a

by outputting

çocuk     niño‣niña

or

çocuk     niño⁋niña

(PanLex’s serialization scripts treat as a synonym delimiter and as a meaning delimiter. The mnsplit script splits entries that contain meaning delimiters into multiple entries.)

Punctuation marks

Expressions in PanLex mostly don’t contain punctuation marks, because the conventional lemmatic forms of lexemes usually don’t contain them. When you see punctuation marks in lexemes in sources, you have to decide how to handle them.

One treatment that may be appropriate is to parenthesize certain punctuation marks. If, for example, your tabularization code included the statement

$col[1] =~ s/([?¿])/($1)/g;

you would parenthesize every instance of “?” or “¿” in column 1. The string “¿cómo?” would be output as “(¿)cómo(?)”. Your serialization process would include the exdftag script, which would convert that string to ⫷ex⫸cómo⫷df⫸(¿)cómo(?), documenting the interrogative meaning while using the lemmatic form for the expression.

Here you might wonder why PanLex doesn’t instead recognize “how” and “how?” as two different expressions. It could do so. If it did, then the analysis of sources that make this distinction would become easier. At the same time, the analysis of sources that don’t make it would become harder. Whenever you encountered “how” in a source, you would want to figure out whether it represents the “how” or the “how?” expression. You might be unable to do so without a consultant who knows the other language.

You might also question the use of parentheses in definitions like the one shown above. Why not just make the definition “¿cómo?”? The benefit of using parentheses here is explained below under “Parenthetical definitions”.

Complex lexemes

In traditional lexicography, complex lexemes are usually subordinated to simple ones, but it is not always obvious which simple one to treat as the parent. For example, should “kick the bucket” be placed within the “kick” entry or the “bucket” entry? That problem doesn’t affect PanLex. We simply treat “kick the bucket” as the text of an English expression. It isn’t subordinate to “kick” or “bucket”; it’s an expression equal to them.

Many phrases do, however, confront you with the question: Is this an expression, or is it a definition? The need to choose doesn’t end there. If you opt for a definition, you then decide whether to derive any expressions from it. This decision is important for PanLex. Suppose your source translates “atanak” into “grind into tiny particles”. It’s easy to say that this English phrase is a definition rather than an expression. But if you do no more than that you are possibly condemning “atanak” to a dead-end status. Anybody who wants to translate “atanak” into, say, Russian will be out of luck if this is the only source documenting “atanak”. On the other hand, if you extract “grind” from this phrase as an expression and/or add “pulverize” as an expression, you then make it possible for users to translate “atanak” through English into many other language varieties. The challenge, of course, is that this is likely to require meticulous, non-automatic interpretation.

Complex quasi-lexemes

Some complex translations would not usually be considered lexemes, but it has become a PanLex practice to do more than merely call them definitions. Our practice in some of the most common cases is described below.

Inchoatives

If a complex expression is detectable as a conventional representation of inchoativity in its language variety, PanLex has adopted the practice of treating it as an expression, rather than a definition, and also adding an inchoative classification to its meaning.

In English, for example, “become X” is a conventional verbal representation of inchoativity for “X”. (“Get X” is sometimes used for this purpose, too.) If a source translates many expressions “Z” into English as “become X”, we analyze their entries as meanings with 2 expressions (“Z” and “become X”) and no definitions.

We also add classifications to those meanings, to capture the detected semantic relations. You usually can’t determine what expression “Z” is inchoative of in its language variety, but you can infer that “become X” is inchoative of “X”, and you can classify the meaning accordingly. We do this by creating a classification in which art-316:Inchoative_of is the superclass expression and eng-000:X is the class expression.

Causatives

PanLex has adopted for causativity a practice similar to that for inchoativity. More often, however, limited detectability forces us to apply this practice with caution to causativity.

In English, “make X”, “cause to X”, and “cause to be X” are conventional verbal representations of causativity for “X”. When they are reliably detectable, you can treat entries with such constructions analogously to the treatment for inchoativity. The analogous meaning classification has art-316:Causative_of as the superclass expression.

In English “make X” is not a very reliable indicator of causativity, because of the existence of common non-causative expressions embodying the same pattern, such as “make out” and “make up”, and because some causative “make” expressions, such as “make believe”, “make love”, “make do”, and “make space”, also have non-causative meanings. The simplistic creation of meaning classifications where there is a “make X” translation entails substantial risk of error. If you determine that the risk is significant, you can simply not create meaning classifications in that situation. Such classifications can be created later within the database with the aid of more sophisticated detection methods.

You might be tempted to unify the “make X” and “cause to X” patterns to avoid creating divergent equivalent expressions. Doing this is not part of our usual practice, because of the risks of error and loss of information. Converting “make” to “cause to” would require a rule deciding when the “make” translation is really causative, and a rule choosing between the “cause to X” and “cause to be X” patterns. Converting “cause” to “make” would erase the meaningful distinction between “cause to X” and “cause to be X”. So our practice is to allow these patterns as we find them to be expressions, classify their meanings as causative when we reliably can, and defer any unification of expressions until post-importation.

Statives

The PanLex practice for stativity differs from that for inchoativity and causativity. For stativity, we extract simple expressions from complex translations, preserve the complex translations as definitions, and, where we reliably can, implement rules that add denotation classifications.

In English, “be X” is a conventional verbal representation of stativity for “X”. When we credibly and reliably can, we perform the following treatment of such a translation:

  1. Parenthesize “be”, yielding “(be) X”.
  2. Classify the other language variety’s expression as a verb, with a denotation classification whose superclass expression is art-303:PartOfSpeechProperty and whose class expression is art-303:Verbal.
  3. When we serialize the file, activate the exdftag script so “(be) X” becomes a definition and “X” becomes an expression.

Duratives

The PanLex practice for durativity is a variation on that for stativity. It differs in that we also add meaning classifications when we reliably can.

In English, “stay X” and “remain X” are conventional verbal representations of durativity for “X”. When we credibly and reliably can, we perform the following treatment of such a translation:

  1. Parenthesize “stay” or “remain”, yielding “(stay) X” or “(remain) X”.
  2. Classify the other language variety’s expression as a verb, with a denotation classification whose superclass expression is art-303:PartOfSpeechProperty and whose class expression is art-303:Verbal.
  3. Classify the meaning as durative, with a classification whose superclass expression is art-303:AspectProperty and whose class expression is art-303:DurativeAspect.
  4. When we serialize the file, activate the exdftag script so “(stay) X” or “(remain) X” becomes a definition and “X” becomes an expression.

With this practice, we are implying that durativity makes enough of a difference to justify classifying meanings as durative, but not enough to disqualify an elementary expression as a translation of a durative expression.

Stative alternates

Some translations combine stative with other quasi-expressions, such as “be or stay X” or “be or become X”.

The PanLex practice, where reasonably feasible, is to unpack these into pairs of expressions, such as “be X” plus “stay X”, or “be X” plus “become X”, and then split the entries into pairs of distinct meanings. The “be” meaning then gets the stative treatment, and the “stay” or “become” meaning gets the durative or inchoative treatment, respectively.

Superordinatives

If a complex expression is detectable as a conventional representation of a superordinative in its language variety, PanLex has adopted a practice similar to that for statives. We extract simple expressions from complex translations, preserve the complex translations as definitions, and, where we reliably can, implement rules that add meaning classifications.

In English, “type of X”, “kind of X”, “species of X”, and abbreviations of these are often found in sources as conventional representations of superordinativity for “X”. In such phrases, “X” is a hypernym, i.e. an expression naming a superordinate. When we credibly and reliably can, we perform the following treatment of such a translation:

  1. Parenthesize “type of”, “kind of”, or “species of”, yielding “(type of) X”, “(kind of) X”, or “(species of) X”.
  2. Create a meaning classification whose superclass expression is art-300:IsA and whose class expression is eng-000:X.
  3. When serializing the file, activate the exdftag script so “(type of) X”, “(kind of) X”, or “(species of) X” becomes a definition and “X” becomes an expression.

Treatment of complements

Choosing the lemmatic form of an expression is often problematic when the meaning of a word depends on its complements. Speakers of English know that “look at her mother”, “look to her mother”, “look after her mother”, “look out for her mother”, and “look like her mother” have distinct meanings. Most traditional dictionaries define “look at”, “look to”, “look after”, “look out for”, and “look like” within a single “look” entry. Should they and “look” be six distinct expressions in PanLex?

Including complements in expressions may require choosing a canonical member of a paradigm. Suppose your source translates a word into “off … rocker”. Should you leave this phrase in that form? Or should you convert it to “off somebody’s rocker”, “off one’s rocker”, “off rocker”, or something else? PanLex editors have been trying to make reasonably consistent decisions. Among their usual practices for dealing with complements in English are:

  • Allow prepositional complements. E.g., allow “believe in” and “with respect to”.
  • Use “one” to represent sets of obligatory subject or reflexive alternatives. E.g., use “off one’s rocker”, “behave oneself”, and “as far as one knows”.
  • Allow questionably acceptable expressions that omit usually embedded members of sets of non-subject, non-reflexive complements or determiners. E.g., allow “call bluff” and “kick upstairs”.

Striving toward consistency ameliorates the proliferation of equivalent expressions. For “call bluff”, for example, we have seen these forms in sources:

call his bluff
call one’s bluff
call somebody’s bluff
call your bluff
calling someone’s bluff

Parenthetical definitions

Some methods shown above involve inserting parentheses into expression texts, to be specially treated later during serialization. When you do this, you are contributing a future improvement in the quality of editing. Here’s how. Your partly parenthesized string becomes a definition, and the unparenthesized part of it most likely becomes an expression. You have thereby contributed to PanLex’s corpus of partly parenthesized definitions. This corpus will become part of the future normalization process. If you have contributed, for example, “(large red) squirrel” as a definition and a future editor finds it impractical to do the same, the normalize script will have incorporated definitional inference. When it is given “large red squirrel” to evaluate, it will examine all the definitions in English to determine whether any of them, with all parentheses removed, match it. It will find your “(large red) squirrel” and confirm that it matches. Then the script will check the unparenthesized part of it (“squirrel”) and find it well-attested. On that basis, it will treat “large red squirrel” as if it were “(large red) squirrel”, extracting an expression that otherwise would be lost. This near-term enhancement of the normalize script may be followed by more intelligent ones, which, for example, can infer “(small yellow) squirrel” from a corpus that contains only “(large yellow) squirrel” and “(small red) squirrel”. We cannot know exactly what beneficial side effects your parenthesizing work will have, but we expect it to have some.

The insertion of parentheses to extract expressions while creating definitions is not a panacea. One of its disadvantages is excessive generality. Suppose, for example, a source translates a word into “fear of being kidnapped”, and you have figured out a way to analyze the source that automatically parenthesizes “of being kidnapped” in this case. You have a well-attested expression, “fear”, and a clear definition. But you could have qualms about this result. You could judge that “fear” is simply too general to be a satisfactory translation of the highly specific word. When you wind up with very general expressions such as “thing”, “tree”, or “person” after parenthesizing everything else, you may be helping users translate among more language combinations, but the resulting translations may be far from synonymous. You may decide that it’s better to exempt some general words from this process. In some cases, you may be able to provide specific synonyms yourself, such as “apagophobia” for “fear of being kidnapped”, though this is labor-intensive interpretation rather than automated analysis.

When you add parentheses to strings in a source, you may interact with parentheses that already exist there. In many sources we find them being used in the same way as we use them, but not always. Sometimes they contain domain information (e.g., “(physics)”), grammatical information (e.g., “(v.t.)”), or mixtures of these, instead of or in addition to definitional information. You can choose to omit all parenthesized substrings, treat them all as if they were definitional, or handle the most common cases with specific actions. For example, if a source has many cases of “(v.)” in a column of expressions (column 0, say), you could deal with those cases with a statement such as:

$col[0] =~ s/ \(v\.\)$/⫷wc⫸verb/;

Since the word-class code provided here (i.e. verb) is one of the standard PanLex word-class codes, it would not be necessary to run the wcretag script during serialization.

Affixes

Many PanLex expressions are affixes. Affixes often appear in dictionaries as entry headwords, and in PanLex sources an affix in one language is often translated into a word in another language.

As with other expressions, affixes sometimes have competing equivalent forms. In German, for example, some sources terminate prefixes with the hyphen-minus character (“-”), while others use the ellipsis character (“…”); and some initially capitalize prefixes that are normally attached to nouns, while others don’t. Our general principle is to convert deviant forms to a single standard form if it is practical to do so. If you leave affixes as you find them and you use the normalize script during serialization, some deviant forms (those differing only in punctuation, spacing, capitalization, and diacritics) can be converted to the prevailing forms. But there is a risk that affixes will be normalized to non-affixes. So, if you see a common pattern and can normalize it automatically during tabularization, that may be wiser than relying on normalize.