This page covers the current design of the PanLex database. It also includes comments on potential modifications. If you are new to PanLex, it would be useful to first read the overview of the PanLex data model.
- Design and implementation issues
- Language varieties
- Classifications and properties
- Meaning classifications and properties
- Denotation classifications and properties
- Source language varieties
- Exemplar characters
- Typical characters
- Source editors
- Language variety editors
- Obsolete elements
Design and implementation issues
The design of the database has been motivated principally by the purpose of recording attestations of denotation translations in such a way as to permit inference from attested to unattested translations. Design decisions have generally been compromises. For some purposes, different decisions would be superior. Design elements have occasionally been changed and may be changed in the future.
Tables in the database are defined with constraints to enforce applicable rules. The constraints are shown in the slides. Constraint types in the database include:
- Type (column value must be integer, character(2), text, etc.)
- Not null (column value must not be null)
- Default (column’s default value)
- Primary key (column or set of columns must not be null and must be unique)
- Unique (column or set of columns must be null or, if not null, unique)
- Foreign key (column or set of columns must be identical to a row in another table)
- Trigger (modification of column causes execution of a function)
Procedures that mediate user access to the database, or that perform routine operations on the database, enforce some additional constraints.
Users, recorded in the
us table, are entities (representing individual persons) that can be used by procedures to control some aspects of access to the database. Each user must be either approved or not, and each user must either be a super-user or not. Any source or language variety may optionally have associated users, and procedures can grant some access rights on it to only those users.
ap table documents authoritative provenances. These are the sources to which the meanings in the database are attributed. We call them “sources” for convenience. Sources usually represent publications, but may instead represent portions of publications, or manuscripts, or individuals, or organizations.
PanLex recognizes a set of language codes, which are all ISO 639-2 collective language codes, ISO 639-3 codes, and ISO 639-5 codes. They are documented by the ISO 639-3 registration authority. There are about 8,100 of them. Each code consists of a string of 3 letters in the a–z range. When the registration authority for any of these codes changes a code, which happens about once a year, we respond by bringing our language codes into conformity with these standards.
lc stores information about language codes. For each, it records the:
- Code itself
- Type (ISO 639-3 individual language, ISO 639-3 macrolanguage, ISO 639-2 collective language, ISO 639-5 language group, or other)
- Latitude according to Glottolog
- Longitude according to Glottolog
- Glottocode (Glottolog code)
In cases of ISO 639 language codes that Glottolog does not recognize, we have sometimes substituted a latitude and longitude from another source or used Glottolog’s latitude and longitude of the nearest Glottolog language variety, and we also sometimes record the Glottocode of the most closely related Glottolog language variety. Such approximations are often close, since Glottolog recognizes about 4,400 dialects in addition to about 7,900 languages. We have found ISO 639 codes missing from Glottolog’s mapping mostly when the ISO 639 codes have been created within the last 2 years. If Glottolog adds a previously omitted mapping, we can then replace approximations with exact data.
As of November 02016 there are about 400 languages without recorded Glottocodes and about 650 languages without recorded latitudes and longitudes. One task for database developers is to find and record these missing data.
PanLex recognizes a set of language varieties. As of November 02016, there are about 11,500 of them (companed with 8,100 language codes). Each language variety has a language code, so there are language codes with multiple varieties. This happens for various reasons:
- Human languages usually have multiple dialects.
- Languages or dialects sometimes have multiple writing systems, and we treat them as distinct varieties.
- Some languages don’t have their own ISO 639 codes, but they do belong to groups of languages for which an ISO 639 code exists. We can use a group code for any languages in the group that don’t have their own.
- Languages as they have existed in various eras have been documented as distinct varieties.
- Special-purpose artificial or controlled versions of languages are sometimes invented.
- Standard lists of codes or labels for concepts are sometimes created. When we find these useful, we treat them as varieties of the
artlanguage, which is the group of artificial languages that don’t have their own language codes.
A variety code, consisting of a small integer in the range 0 to 32,767, distinguishes the language varieties having the same language code from each other. The language codes used in the PanLex database range from having no varieties at all (because we do not yet have any expressions in, or any sources documenting, any variety of them) to having about 400 varieties.
Since the variety codes have not yet come close to requiring 4 digits, we have also given each language variety a 7-character uniform identifier (UID) in the format
abc-023, concatenating its language code and its variety code, delimited by a hyphen-minus.
lv stores information about language varieties. For each, it records:
- A unique integer ID
- A language code
- A variety code
- Whether the language variety is mutable
- The ID of the expression that serves as the language variety’s default name
- The ID of the expression that identifies the language variety’s script
Thus, each language variety has not just one but two distinct unique identifiers: the integer ID and its UID (the combination of its language code and its variety code). These two identifiers are therefore redundant at any particular time. Still, we use both of them. The unique integer ID is more durable, because the language code and/or the variety code of a language variety can change. But the UID is more usable, being easier to remember.
Some language varieties are defined to be immutable. They are controlled by various sources, and changes in their sets of expressions (additions, deletions, or text changes) are restricted. If you try to assimilate new expressions or any definitions in them into the database, you will not be permitted to do that. Modifying their sets of expressions requires temporarily reclassifying those language varieties as mutable, something that only authorized editors can do.
Each language variety has one default name, which is an expression. The default name is not a unique identifier of a language variety, because multiple language varieties (even of the same language) can have the same name. We try, however, to avoid assigning the same default name to more than one variety of the same language. To the extent possible, we choose as a language variety’s default name the usual name of the language variety in the language variety itself, i.e., its autoglossonym.
Inspection of the list of PanLex language varieties makes it clear that we have not yet found autoglossonyms for all of them, and in some cases (such as extinct languages) it seems impossible to do so. However, we presume that many other missing autoglossonyms can be found. One task for database developers is to find them and replace the existing default names with them.
Each language variety is written in a single script, as defined by ISO 15924. This is stored as the ID of the PanLex expression in language variety
art-262 (ISO 15924 codes) containing the four-letter script code.
Additional facts about language varieties are recorded in other tables:
av: the sources that document them
cp: the characters that typically occur in their expressions’ texts
cu: the characters that the Unicode Standard considers exemplar characters in them
lu: their editors
ex: the expressions in them
df: the definitions in them
UIDs as expressions
Since there are about 11,500 language varieties and each has a unique UID, there are about 11,500 UIDs. As mentioned above, some language varieties have standard sets of labels or codes as their expressions. We have created one of these for UIDs. It has the default name “ISO 639-PanLex” and the UID
art-274, and it is immutable. Thus, it contains about 11,500 expressions, whose texts are the UIDs. We have also created a special source to record data useful for PanLex that no other source contains. Its title is “PanLex meta-source”, and its unique label is
art:PanLex. This source has created a meaning for every language variety, with two denotations. One of the denotations assigns the meaning to the
art-274 expression corresponding to the UID. The other denotation assigns the meaning to the expression acting as the language variety’s default name. The database automatically enforces this relationship between the
lv table and the meanings of source
As a PanLex developer, you can create sources and add their data to the database. You yourself can be a source (for example, developer Kim Mar can have a source with the label
mul:Mar and the title “Lexical Translations”). Thus, if you know things about a language variety, you can add a meaning of your source to the database, assign that meaning to the language variety’s UID expression in
art-274, and then create other meaning details (definitions, denotations, classifications, and properties) containing what you know. This can include denotations identifying names of the language variety in various language varieties. Your meaning’s details can help developers and users understand what that language variety represents.
By convention, any language variety with variety code
0 is deemed the dominant, standard, or unmarked variety among those all the varieties of its language.
By convention, we also make variety codes in any language consecutive, starting with 0.
If new knowledge persuades us that any variety codes should be changed, we do so.
Declaring the language varieties documented by a source sometimes requires adding a new language variety to the database. When you decide to do that, PanLem automatically assigns the next available variety code within the language to it.
There are other systems of language-variety classification with more motivated labels than our sequential integers. We use our automatic method so that we don’t need to expend great effort trying to understand the place of each language variety in its ecosystem, something that is not part of our core mission and that others specialize in. We are, however, researching the problem of linking the PanLex system with others.
Some artificial language varieties are designed to represent the concepts in particular domains. We call these concepticons. They are also called ontologies, thesauri, concept inventories, standards, terminologies, etc. Their expressions may be identical to, or derived from, expressions in natural languages, or may be formed out of digits or non-literal symbols. In PanLex, we use them to avoid ambiguity and synonymy and to consolidate meanings.
If you are registering a source and it uses a set of identifiers that are not merely idiosyncratic to that source but are drawn from a published standard, and if that set of identifiers is not yet a language variety in PanLex, you should register it as a language variety and declare it as one of the language varieties documented by the source. But should you make the new language variety immutable, even though it does not yet have any expressions? Yes, and, when an editor assimilates the data from that source, the editor will temporarily make it mutable. This will protect the language variety from having expressions incorrectly created in it before the source is consulted.
An expression, recorded in the
ex table, is an object that lexically expresses a meaning in a language variety. Each expression has four properties: an ID, the ID of its language variety, a text, and a degraded text. These properties are the values of the fields in the
The “expression” concept differs from some similar concepts:
- word: An expression’s text can consist of multiple words (e.g., “side effect”)
- lexeme: An expression can be ambiguous (e.g., “lead” meaning “be ahead” and “the element Pb”)
- lemma: An expression’s text is generally a lemma, but not necessarily (e.g., “in any event”, which might be found under “event” in a dictionary rather than given an entry of its own)
When there is no lexicalization of a meaning in a language variety, there is no expression in that variety with that meaning. There may still be an explanation of the meaning, but that is classified as a definition, rather than an expression. This classification is a matter of judgment, made by editors. Definitions typically consist of four or more words and express meanings compositionally.
A trigger function (
td) automatically derives the degraded text of an expression from the text whenever a new expression is created or an existing expression is modified. The degraded text could be omitted from the expression record and computed whenever needed, but it is precomputed and stored for efficiency. Users can search for all expressions with particular degraded texts or parts thereof. It would be inefficient to compute all degraded texts for each such search.
The motivation for making degraded texts available is that sources of lexical data and users specifying the texts of expressions are not always exact or consistent in their specification of the texts. Texts differing from one another in many ways (such as texts with and without hyphens, texts written as one word and as two words, texts differing only in letter case, or texts with and without diacritical marks) can be perceptually the same. With degraded texts, a search can retrieve all expressions whose degraded texts are the same as the degraded text that is specified in the search query. However, the PanLex algorithm for text degradation fails to capture some similarities (such as “color” versus “colour”, or “pant” versus “pants”). Thus, it is only one of the possible algorithms that could be used for the improvement of the intuitive efficacy of text matching.
Degradation was initially implemented for expressions, but has now been applied to definitions as well.
We consider possible improvements to the degradation algorithm as we receive suggestions. Proposals for the degradation of the Arabic script in general and for Urdu and for confusable-based degradation were published by Anshuman Pandey in May 02015. In November 02015 he commented on the degradation of Tibetan.
There is more detailed documentation on the degradation algorithm.
Expressions in the database can have meanings, recorded in the
mn table. When two or more expressions have the same meaning, they can be considered translations or synonyms of one another. Each meaning is attributed to exactly one source.
A denotation, recorded in the
dn table, is the assignment of a meaning to an expression. Therefore, each denotation is attributed to the source to which its meaning is attributed. Denotations embody the most fundamental facts in the database: the facts of lexical translations. When a user, acting as editor, assimilates a source and finds in it authority for the assertion that some two or more expressions are translations or synonyms of one another, the user records that assertion in the database by creating a meaning, associating the meaning with the source, and creating denotations which assign that meaning to the expressions (one denotation per expression).
Any meaning can have zero or more definitions, recorded in the
df table. Each definition is an arbitrary text value combined with a language variety. The intent is to assert that the text is a statement in that language variety, and that the statement explicates the meaning. Being morphologically and syntactically ignorant, the database cannot and thus does not enforce the claim that a definition is in a particular language variety.
Immutable language varieties, too, are eligible for definitions. For example, if an immutable variety contains expressions “Parent” and “Child” and some meaning’s extension is the union of all parents and all children, a definition with the text “Parent∪Child” might be construed as being in that variety.
Classifications and properties
From 02013 to 02015 some design changes were proposed that consolidated some columns and tables into a scheme of classifications and properties. In particular, meanings, denotations, and sources could all have classifications and properties. In this section we describe how classifications and properties work in general. After that, the following sections detail their application to meanings, denotations, and sources, respectively.
Meaning and denotation classifications and properties were implemented in 02015. As of June 02016, source classifications and properties were not yet implemented. This documentation is written as if they are implemented, so that, if they become implemented, extensive editing of the documentation will not be necessary.
Classifications and properties offer a method for describing optional facts about meanings, denotations, and sources.
The records of these types contain some facts, generally those that every record must have. For example, every denotation must have a meaning and an expression, so they are part of the denotation’s record in the
In addition, meanings, denotations, and sources can have optional facts of many kinds. Classifications and properties describe such facts. For example, if a denotation assigns meaning 12345 to the English expression “fair”, the source may have described that expression as an adjective, thereby distinguishing it from “fair” used as a noun, or may have labeled that meaning as “J7866”, thereby providing a reference to its entry in the source.
To describe such optional facts, it is only natural to use expressions, so, to the extent possible, that is what we do. However, for any fact one can imagine thousands of expressions, expressing it in thousands of different language varieties. If one editor uses “Adjektiv” and another editor uses “Eigenschaftswort”, both in the same language variety, or one uses “adjective” and another uses “ọ̀rọ̀ àpéjúwe”, in two different language varieties, it can be difficult to ascertain that the facts are identical.
Therefore, PanLex editors adhere to the practice of using concepticons as their preferred sources of expressions in classifications and properties. Specifically, we choose particular preferred concepticons for facts in particular domains, and, when we don’t find the right expressions there, we use or create expressions in a concepticon created by us for the purpose.
A classification is represented either by one or two expressions. We call these classifications unary and binary, respectively. We refer to the expression representing a unary classification as a class expression. A binary classification is represented by both a class expression and a superclass expression.
Classes and superclasses in a binary classification perform various functions. For example:
- Class: “noun”; superclass: “part of speech”
- Class: “masculine”; superclass: “gender”
- Class: “Argentina”; superclass: “capital city”
- Class: “airplane”; superclass: “part”
- Class: “Creative Commons”; superclass: “license”
In one way or another, the class expression names something related to the thing being classified, and the superclass expression specifies the way in which the thing belongs or relates to it.
Superclass expressions sometimes represent symmetric (e.g., A resembles B) or asymmetric (e.g., A was published by B) relations. If asymmetric, the directionality may be crucial. For example, if the superclass expression is “part”, it may be crucial to know whether A is a part of B or B is a part of A. One way to assure this is to choose superclass expressions that have meanings with definitions making directionality clear. Another is to use expressions designed to make directionality unambiguous, such as “IsPartOf”. Such expressions exist in some artificial language varieties (concepticons).
A unary classification represents a property or category but doesn’t further specify how it relates to the thing being classified. It says “A has property C” or “A belongs to category C”.
As a practice, we prefer binary classifications. For example, to say that a denotation has an expression whose gender is masculine, we prefer to use a classification with both “gender” superclass and “masculine” class expressions, rather than leaving “gender” to be inferred from “masculine” alone.
Binary classifications, though they give more information than unary ones, do not convey all the information that may exist about a classification. The concepticons from which classes and superclasses are usually drawn sometimes contain multi-level hierarchies. When you interpret something in a source as a classification, you may need to judge whether to choose the class’s parent or some more remote ancestor as the superclass. For example, if you choose “IndefiniteArticle” in GOLD 2010 as a denotation’s class, you may choose its parent, “Article”, or that parent’s parent, “Determiner”, or that parent’s parent, “PartOfSpeechProperty”, as the superclass. If you skip over the parent, you may inadvertently introduce ambiguity, because a concepticon might have more than one of the same expression with the same ancestor but different parents (GOLD 2010 doesn’t exhibit this problem). Whenever the PanLex team develops uniform practices in the making of such choices, we document them in the relevant specific sections of this site.
The failure of PanLex’s classification scheme to represent entire taxonomies is deliberate. Classifications are not among the main things that PanLex documents. They provide supplementary information, often less refined than what is found in sources. We use them mainly when we can easily and automatically apply them during assimilation.
Classifications of any particular object type, whether unary or binary, are stored in a single table. If the classification is unary, the superclass expression is null. This is because it is normally reasonable to interpret the superclass expression (e.g., “gender”) as the one that is implicit in a unary classification.
A property, like a classification, is a complex fact, but its components differ in part. A property always includes an expression (an attribute), but it also includes a text string (a value). We use properties when it is unreasonable to expect that there can be an expression describing a class (e.g., where the value could be any number or any sequence of symbols).
You can choose any expression as a superclass expression, class expression, or attribute. Sometimes there is no reasonable limit to the expressions that might be chosen. For example, if the superclass expression is “nominalization”, meaning that expression A is a nominalization of the class expression, then the latter is whatever expression A is a nominalization of (such as “continue” where A is “continuation”). But in other cases it is useful to limit the expressions from which we draw. Expressions often represent concepts, and it is useful to choose the same expression whenever we wish to represent a particular concept. Thus, in practice we rely on concepticons as the sources of expressions where possible. For expressions of a particular type (such as denotation superclass expressions), we sometimes rely on two concepticons: one that we have found, and another that we have created to accommodate needed expressions missing from the one we have found.
When you submit a final source file for importation and any of its classifications or properties specifies an expression that doesn’t exist, this may or may not be an error. If the language variety is mutable, PanLem creates the expression. If the language variety is immutable, you get an error message back.
Meaning classifications and properties
Above we discuss classifications and properties in general. Additional details specific to meaning classifications and meaning properties are discussed here.
In March 02015 we received a suggestion by Michael Ellsworth of the FrameNet project that PanLex add semantic relations to the database. This and related proposals have been accommodated with meaning classifications and properties. They generalize two object types previously in the database:
- meaning identifiers
- domain specifications
Meaning classifications are stored in table
mcs. It has the structure
mcs serial primary key mn integer not null ex0 integer ex1 integer not null
mcs absorbed table
dm (domain specifications).
Meaning properties are stored in table
mpp. It has the structure:
mpp serial primary key mn integer not null ex integer not null tt text not null
mpp absorbed table
mi (meaning identifiers).
You can use meaning classifications and properties to preserve semantic information found in sources. It is common, for example, for sources to indicate that a thing is a kind, type, or species of another thing. We see this when a source translates expression A as “type of B”. When you see this pattern, you can extract “B” as the translation while also attaching a meaning classification whose superclass expression is “IsA” in ConceptNet 5 (
art-300) and whose class expression is B. Before the existence of classifications, our usual practice was to parenthesize “type of”, causing the
exdftag serialization script to make “(type of) B” a definition and “B” an expression. The meaning classification is similar to such a definition, but is more tractable, since it unifies what would otherwise be many definition patterns in many languages.
Denotation classifications and properties
Above we discuss classifications and properties in general. Additional details specific to denotation classifications and denotation properties are discussed here.
In 02013 to 02015 we received a suggestion by PanLex intern Amandalynne Paullada that PanLex might beneficially modify the design of metadata to make variables expressions instead of strings. This and related proposals have been accommodated with denotation classifications and properties. They generalize two object types previously in the database:
- word classifications
Denotation classifications are stored in table
dcs. It has the structure
dcs serial primary key dn integer not null ex0 integer ex1 integer not null
dcs absorbed table
wc (word classifications) and part of table
Denotation properties are stored in table
dpp. It has the structure:
dpp serial primary key dn integer not null ex integer not null tt text not null
dpp absorbed part of table
You can use denotation classifications and properties to record information about expressions other than their lemmas. Word class (noun, verb, etc.), gender, register (formal, vulgar, etc.), tone patterns, and other information can be recorded.
In many sources, there is no formal structure to accommodate such extra information, and authors may prepend or append it to lemmas, like “Frage n.f.” or “bluenose (slang, 1920s)”. One task of PanLex editors is to extract lemmas from these strings and decide whether the remainder merits treatment as denotation classifications or properties.
Source language varieties
Sources may be associated with language varieties. Such an association, recorded in the
av table, records the fact that the source documents the language variety. Such a record is redundant if the database contains all of the denotations that the source provides evidence for. But that is not true for sources that have not yet been fully assimilated.
The character encoding of the database is UTF-8. Any Unicode character can be included in a text value of any column in any table. Procedures that add and modify data further restrict text values, limiting them to values that comply with Unicode Normalization Form NFC. Values in the
td column of the
ex table, containing degraded expression texts, are subject to additional restrictions.
For the development of quality-control procedures, and for editorial inspection, it can be useful to classify characters in terms of whether they are normal for particular language varieties. The database maintains records, in the
cu table, identifying the statuses of characters as Unicode exemplar characters in language varieties.
Editors of the database adhere to some conventions with respect to characters in the texts of expressions, by standardizing the orthographic forms of expressions. To record decisions that they make about such standards, editors can record, in the
cp table, classifications of particular characters as typical for particular language varieties.
Each source can be associated, in the
au table, with 0 or more users. Procedures can require that particular operations on sources be performed only by users associated with them. Such users can be considered those sources’ editors.
Language variety editors
Each language variety can be associated, in the
lu table, with 0 or more users. Procedures can require that particular operations on language varieties be performed only by users associated with them. Such users can be considered those language varieties’ editors.
Synonymy and ambiguity
Language varieties until May 02015 possessed properties of synonymy (column
sy) and ambiguity (column
sy). Their values for most language varieties were
true. This meant that those language varieties permitted both synonymy and ambiguity: Multiple expressions could have the same meaning (synonymy), and a source could assign multiple meanings to the same expression (ambiguity). For some language varieties, the value of one or the other or both of these was false. If synonymy was false, PanLem deleted a denotation if a source created another denotation with the same meaning and with any other expression in that language variety. If ambiguity was false, PanLem deleted a denotation if a source created another denotation with the same expression and another meaning. We retired these properties from the database design after concluding that they had no significant value for PanLex. They had been made false for controlled language varieties that purportedly eschewed ambiguity and synonymy, but, because of the lack of semantic isomorphism between language varieties, expressions in even such varieties could reasonably be translated into multiple, semantically distinct expressions in other varieties, including other controlled varieties, thus giving rise to ambiguity and synonymy.
The language varieties that, before this retirement, had false values on ambiguity, synonymy, or both were:
lv | lc | vc | sy | am | ex | tt ------+-----+-----+----+----+----------+------------------------------------------------------ 1127 | art | 0 | f | f | 18584009 | PanLem 1622 | art | 1 | f | f | 18584011 | ISO 639-3 1623 | art | 2 | f | f | 18584013 | ISO 639-2/T 1624 | art | 3 | f | f | 18584015 | ISO 639-2/B 1625 | art | 4 | f | f | 18584017 | ISO 639-1 41 | art | 5 | f | f | 18584019 | ISO 639 1242 | art | 6 | t | f | 21038398 | ISO 3166 alpha 1785 | art | 12 | f | f | 18584033 | Swadesh 207 1945 | art | 15 | f | f | 18584039 | ISO 639-5 5560 | art | 245 | f | f | 18584497 | Swadesh 100 6708 | art | 253 | f | f | 5600054 | Universal Networking Language 6712 | art | 254 | f | f | 18584515 | U+ 6713 | art | 255 | f | f | 18584517 | ISO 80000-9 Annex A 6714 | art | 256 | f | f | 18584519 | SI 6719 | art | 257 | f | f | 18597680 | LWT Code 6889 | art | 260 | f | f | 18584526 | Swadesh 200 6913 | art | 261 | f | f | 21069318 | SILCAWL 6845 | art | 262 | f | f | 18584530 | ISO 15924 6915 | art | 263 | f | f | 18584532 | STAR-IEML 6920 | art | 264 | f | f | 18584534 | ISO/IEC 2382-14 6953 | art | 266 | f | f | 18584538 | Swadesh-Gudschinsky 200 6955 | art | 267 | f | f | 21069319 | ALCAM 120 6992 | art | 268 | f | f | 21069320 | ABVD 210 6993 | art | 269 | f | f | 21069321 | ℤ 7017 | art | 270 | f | f | 18584546 | LEGO Concepticon 7219 | art | 273 | f | f | 18584552 | PanLex Union Concepticon 7257 | art | 274 | f | f | 18584554 | ISO 639-PanLex 9070 | art | 276 | f | f | 18813575 | New International File Names for the Atomic Elements 9092 | art | 277 | f | f | 21069322 | Swadesh-Yakhontov 110 9148 | art | 280 | f | f | 603660 | New Basic Holle List 9312 | art | 282 | f | f | 7313149 | PanLex Empirical Concepticon A 9314 | art | 283 | f | f | 7313150 | PanLex Empirical Concepticon B 1621 | art | 289 | f | f | 18587366 | ISO 639-3 Reference Names 1626 | art | 290 | t | f | 21014105 | ISO 639-3 Print Names (34 rows)
Any meaning can, optionally, have a meaning identifier, recorded in the
mi table. No meaning can have more than one meaning identifier. A meaning identifier usually is some code that is used by the meaning’s source to identify a particular translation.
Meaning identifiers are now represented as meaning properties with the attribute expression “identifier” in variety
A meaning may optionally have domain specifications, recorded in the
dm table. They are expressions. Attaching a domain specification to a meaning asserts that the meaning is within the domain described by the expression.
In earlier versions of PanLex, domain specifications were structured identically to definitions. Specifically, each was an arbitrary string specified as being in a language variety. Experience showed that the texts of domain specifications were usually identical to the texts of expressions. The design was modified to require domain specifications to be expressions. Now, in the table of domain specifications, the column identifying the specification has as its value the ID of the expression serving as the specification.
This constraint on domains is advocated by Gerard de Melo and Gerhard Weikum in their 02010 article, Towards Universal Multilingual Knowledge Bases. They say that “some knowledge bases rely on a separate vocabulary of domain labels …. We instead advocate following WordNet in using identifiers already present in the knowledge base …. This has the advantage of extensive information about the domains being readily available ….”
Because domain specifications are expressions, we consider it a good practice to select lemmatic forms as domain specifications, rather than adding expressions to PanLex for the purpose of having them act as domain specifications. For example, when assimilating a resource that uses “mammals” as an English domain specification, we use “mammal” instead.
Meaning identifiers are now represented as meaning classifications with the superclass expression “HasContext” in variety
Parts of speech (noun, verb, etc.) are usually attributed to words and phrases in syntactic contexts. A corresponding attribution in the database is the word classification, recorded in the
wc table. Any denotation can have 0 or more word classifications. Each word classification assigns 1 word class, selected from a finite set of word classes, recorded in the
wcex table, to a denotation.
Word classifications as properties of denotations are dissimilar to the uses of parts of speech by some sources. Wordnets, for example, attribute parts of speech to synsets, which correspond to PanLex meanings. Wordnets do not permit words and phrases within a synset to have distinct parts of speech, but PanLex permits expressions that share a meaning to have distinct word classifications, because each expression has its own denotation and each denotation can have its own word classification(s).
Word classifications are now represented as denotation classifications, generally with the superclass expression “PartOfSpeechProperty” in variety
Lexicographic documentation includes the assignment of various properties to words and phrases beyond parts of speech, such as tones, gender, inflectional paradigm, illocutionary force, register, etymology, and era or period. In order to permit such information to be recorded in the database, the database permits metadata, recorded in the
md table, to be assigned to denotations. A metadatum is a combination of 2 text values. One is called the variable, and the other is called the value. Unlike definitions, these values are not associated with language varieties. Editors may, at their discretion, adopt conventions in their choices of the values.
Metadata are now represented as denotation classifications and properties. Commonly used
vl strings were mapped to expressions, chosen from concepticons if practical. Wherever
vl was mapped to an expression, the
md record was converted to a
dcs record. Otherwise it was converted to a
dpp record. In the latter case, if its
vb string had not been mapped to an expression, that string and the “=” symbol were prefixed to the
vl string to produce a
dpp value, and “LinguisticProperty” in
art-303 (GOLD 2010) was used as the
dpp attribute expression.