PanLex data model

The data of PanLex are lexical, so the data model underlying the PanLex database is partly that found in lexicography. However, PanLex isn’t just a large multilingual dictionary, and the terms that we use to describe the components of the PanLex database are in some ways specific to PanLex. Conversely, some common lexicographic concepts, such as “entry” and “headword”, don’t apply directly to PanLex.

The main concepts in PanLex are the following:

  • language is the set of linguistic varieties designated by a single ISO 639 three-letter (“alpha-3”) code. This code is the language code.
  • language variety is a particular variety of a language, designated by a three-digit variety code. Variety codes are specific to PanLex, starting with 000, then 001, etc.
  • Each unique language variety is designated with a uniform identifier (UID) consisting of the language code, a hyphen, and the variety code. For example, eng-000 designates English.
  • Criteria for creating new variety codes include (but are not limited to) regional variation and different writing systems. For example, eng-004 designates American English and rus-001 designates romanized Russian. The most widely spoken variety of a language is often variety 000, but there is no systematic rule for the assignment of variety codes.
  • An expression is a string of characters in a language variety, representing a lexeme or lexicalized phrase. The string is, where possible, the standard written form of the lexeme (its lemma) or phrase. Sometimes two or more lexemes in a language variety, such as “tear” (rip) and “tear” (teardrop) in English, share the same written form, and so are a single expression in PanLex.
  • A resource is anything (for example, a print dictionary or a website) that documents equivalences between or among expressions. Most commonly, it translates expressions from one language variety to another, but it may document synonyms within a single language variety instead (for example, a thesaurus).
  • A source is a resource, or part of a resource, as represented in PanLex. Sometimes a resource contains more than one PanLex source, such as a bilingual dictionary that has translations in both directions. Each source has a label of the form xxx-yyy:Zzz, where xxx and yyy are language codes and Zzz is a brief source identifier, generally the first author’s surname. For example, the source label eng-ben:Burford indicates that the source contains translations between varieties of English (eng) and Bengali (ben), and that Burford is the principal author. When a source contains more than two or three languages, the designation mul, meaning “multiple languages”, may be used in lieu of two or more of its languages.
  • meaning is an arbitrary number assigned to each set of intertranslated expressions in a source. For example, an English-German dictionary may contain an entry “head — Kopf; Chef”, intending to communicate that “head” has two senses with distinct German translations. PanLex recognizes these as distinct meanings, assigning one to the set of “head” and “Kopf”, and the other to the set of “head” and “Chef”. Since each meaning belongs to some source, the equivalence that another source may assert between “head” and “Kopf” is a distinct meaning.
  • denotation is an arbitrary number assigned to each pairing of an expression with a meaning in a source. In the previous example, there would be four denotations: one for the assignment of “head” to the first meaning, one for the assignment of “Kopf” to the first meaning, one for the assignment of “head” to the second meaning, and one for the assignment of “Chef” to the second meaning.
  • A translation is an expression that is arguably equivalent to another expression. If it is a distance-1 translation, the equivalence is attested by a source. If it is a distance-n translation, where n > 1, the equivalence is inferred from chains of distance-1 translations, with n being the length of the shortest chain.
  • A definition is an explanatory string in some language variety, describing a meaning. Definitions are typically found in PanLex when a source does not provide a translated expression, or annotates the expression in some way. For example, “kind of white crab found on the beach” and “tear (rip)” would be eng-000 definitions. In the second case, the expression “tear” could easily be extracted and included in the meaning, along with the definition.
  • A classification is a way of indicating a category or relation. Currently only meanings and denotations can be classified, but in principle sources, languages, and language varieties could be classified as well. Each classification is specified with a superclass expression (when possible, taken from a recognized standard that has been added to PanLex as an artificial language variety) and a class expression (sometimes taken from a recognized standard, sometimes from a spoken language variety). Example uses of classifications are to indicate the part of speech of a denotation (noun), the speech register of a denotation (formal), the semantic domain of a meaning (physics), and the semantic relation between a meaning and an expression (for example, a meaning containing English “blacken” could be classified as the inchoative of English “black”).
  • property is similar to a classification but is specified with an attribute expression (similar to a classification’s superclass expression) and an arbitrary string. Example uses of properties are to indicate a denotation’s phonetic transcription and a meaning’s identifier (if a resource assigns unique identifiers to its meanings).

For a more technical description of these and other concepts and how they are implemented, you can read about the design of the PanLex database.