Database design

Introduction

This page covers the current design of the PanLex database. It also includes comments on potential modifications. If you are new to PanLex, it would be useful to first read the overview of the PanLex data model.

Older versions of the design were described in 02014 by Kamholz et al. and a set of slides.

Design and implementation issues

The design of the database has been motivated principally by the purpose of recording attestations of denotation translations in such a way as to permit inference from attested to unattested translations. Design decisions have generally been compromises. For some purposes, different decisions would be superior. Design elements have occasionally been changed and may be changed in the future.

Constraints

Tables in the database are defined with constraints to enforce applicable rules. The constraints are shown in the slides. Constraint types in the database include:

  • Type (column value must be integer, character(2), text, etc.)
  • Not null (column value must not be null)
  • Default (column’s default value)
  • Primary key (column or set of columns must not be null and must be unique)
  • Unique (column or set of columns must be null or, if not null, unique)
  • Foreign key (column or set of columns must be identical to a row in another table)
  • Trigger (modification of column causes execution of a function)

Procedures that mediate user access to the database, or that perform routine operations on the database, enforce some additional constraints.

Users

Users, recorded in the us table, are entities (representing individual persons) that can be used by procedures to control some aspects of access to the database. Each user must be either approved or not, and each user must either be a super-user or not. Any source or language variety may optionally have associated users, and procedures can grant some access rights on it to only those users.

Sources

The ap table documents authoritative provenances. These are the sources to which the meanings in the database are attributed. We call them “sources” for convenience. Sources usually represent publications, but may instead represent portions of publications, or manuscripts, or individuals, or organizations.

Languages

PanLex recognizes a set of language codes, which are all ISO 639-2 collective language codes, ISO 639-3 codes, and ISO 639-5 codes. They are documented by the ISO 639-3 registration authority. There are about 8,100 of them. Each code consists of a string of 3 letters in the a–z range. When the registration authority for any of these codes changes a code, which happens about once a year, we respond by bringing our language codes into conformity with these standards.

Table lc stores information about language codes. For each, it records the:

  • Code itself
  • Type (ISO 639-3 individual language, ISO 639-3 macrolanguage, ISO 639-2 collective language, ISO 639-5 language group, or other)
  • Latitude according to Glottolog
  • Longitude according to Glottolog
  • Glottocode (Glottolog code)

In cases of ISO 639 language codes that Glottolog does not recognize, we have sometimes substituted a latitude and longitude from another source or used Glottolog’s latitude and longitude of the nearest Glottolog language variety, and we also sometimes record the Glottocode of the most closely related Glottolog language variety. Such approximations are often close, since Glottolog recognizes about 4,400 dialects in addition to about 7,900 languages. We have found ISO 639 codes missing from Glottolog’s mapping mostly when the ISO 639 codes have been created within the last 2 years. If Glottolog adds a previously omitted mapping, we can then replace approximations with exact data.

As of November 02016 there are about 400 languages without recorded Glottocodes and about 650 languages without recorded latitudes and longitudes. One task for database developers is to find and record these missing data.

Language varieties

PanLex recognizes a set of language varieties. As of November 02016, there are about 11,500 of them (companed with 8,100 language codes). Each language variety has a language code, so there are language codes with multiple varieties. This happens for various reasons:

  • Human languages usually have multiple dialects.
  • Languages or dialects sometimes have multiple writing systems, and we treat them as distinct varieties.
  • Some languages don’t have their own ISO 639 codes, but they do belong to groups of languages for which an ISO 639 code exists. We can use a group code for any languages in the group that don’t have their own.
  • Languages as they have existed in various eras have been documented as distinct varieties.
  • Special-purpose artificial or controlled versions of languages are sometimes invented.
  • Standard lists of codes or labels for concepts are sometimes created. When we find these useful, we treat them as varieties of the art language, which is the group of artificial languages that don’t have their own language codes.

Identifiers

A variety code, consisting of a small integer in the range 0 to 32,767, distinguishes the language varieties having the same language code from each other. The language codes used in the PanLex database range from having no varieties at all (because we do not yet have any expressions in, or any sources documenting, any variety of them) to having about 400 varieties.

Since the variety codes have not yet come close to requiring 4 digits, we have also given each language variety a 7-character uniform identifier (UID) in the format abc-023, concatenating its language code and its variety code, delimited by a hyphen-minus.

Table lv

Table lv stores information about language varieties. For each, it records:

  • A unique integer ID
  • A language code
  • A variety code
  • Whether the language variety is mutable
  • The ID of the expression that serves as the language variety’s default name

Thus, each language variety has not just one but two distinct unique identifiers: the integer ID and its UID (the combination of its language code and its variety code). These two identifiers are therefore redundant at any particular time. Still, we use both of them. The unique integer ID is more durable, because the language code and/or the variety code of a language variety can change. But the UID is more usable, being easier to remember.

Some language varieties are defined to be immutable. They are controlled by various sources, and changes in their sets of expressions (additions, deletions, or text changes) are restricted. If you try to assimilate new expressions or any definitions in them into the database, you will not be permitted to do that. Modifying their sets of expressions requires temporarily reclassifying those language varieties as mutable, something that only authorized editors can do.

Each language variety has one default name, which is an expression. The default name is not a unique identifier of a language variety, because multiple language varieties (even of the same language) can have the same name. We try, however, to avoid assigning the same default name to more than one variety of the same language. To the extent possible, we choose as a language variety’s default name the usual name of the language variety in the language variety itself, i.e., its autoglossonym.

Inspection of the list of PanLex language varieties makes it clear that we have not yet found autoglossonyms for all of them, and in some cases (such as extinct languages) it seems impossible to do so. However, we presume that many other missing autoglossonyms can be found. One task for database developers is to find them and replace the existing default names with them.

Other tables

Additional facts about language varieties are recorded in other tables:

  • av: the sources that document them
  • cp: the characters that standardly occur in their expressions’ texts
  • cu: the characters that the Unicode Standard considers exemplar characters in them
  • lu: their editors
  • lvsc: the scripts that they are standardly written in
  • ex: the expressions in them
  • df: the definitions in them

Details

UIDs as expressions

Since there are about 11,500 language varieties and each has a unique UID, there are about 11,500 UIDs. As mentioned above, some language varieties have standard sets of labels or codes as their expressions. We have created one of these for UIDs. It has the default name “ISO 639-PanLex” and the UID art-274, and it is immutable. Thus, it contains about 11,500 expressions, whose texts are the UIDs. We have also created a special source to record data useful for PanLex that no other source contains. Its title is “PanLex meta-source”, and its unique label is art:PanLex. This source has created a meaning for every language variety, with two denotations. One of the denotations assigns the meaning to the art-274 expression corresponding to the UID. The other denotation assigns the meaning to the expression acting as the language variety’s default name. The database automatically enforces this relationship between the lv table and the meanings of source art-274.

As a PanLex developer, you can create sources and add their data to the database. You yourself can be a source (for example, developer Kim Mar can have a source with the label mul:Mar and the title “Lexical Translations”). Thus, if you know things about a language variety, you can add a meaning of your source to the database, assign that meaning to the language variety’s UID expression in art-274, and then create other meaning details (definitions, denotations, classifications, and properties) containing what you know. This can include denotations identifying names of the language variety in various language varieties. Your meaning’s details can help developers and users understand what that language variety represents.

Variety codes

By convention, any language variety with variety code 0 is deemed the dominant, standard, or unmarked variety among those all the varieties of its language.

By convention, we also make variety codes in any language consecutive, starting with 0.

If new knowledge persuades us that any variety codes should be changed, we do so.

Declaring the language varieties documented by a source sometimes requires adding a new language variety to the database. When you decide to do that, PanLem automatically assigns the next available variety code within the language to it.

There are other systems of language-variety classification with more motivated labels than our sequential integers. We use our automatic method so that we don’t need to expend great effort trying to understand the place of each language variety in its ecosystem, something that is not part of our core mission and that others specialize in. We are, however, researching the problem of linking the PanLex system with others.

Concepticons

Some artificial language varieties are designed to represent the concepts in particular domains. We call these concepticons. They are also called ontologies, thesauri, concept inventories, standards, terminologies, etc. Their expressions may be identical to, or derived from, expressions in natural languages, or may be formed out of digits or non-literal symbols. In PanLex, we use them to avoid ambiguity and synonymy and to consolidate meanings.

If you are registering a source and it uses a set of identifiers that are not merely idiosyncratic to that source but are drawn from a published standard, and if that set of identifiers is not yet a language variety in PanLex, you should register it as a language variety and declare it as one of the language varieties documented by the source. But should you make the new language variety immutable, even though it does not yet have any expressions? Yes, and, when an editor assimilates the data from that source, the editor will temporarily make it mutable. This will protect the language variety from having expressions incorrectly created in it before the source is consulted.

Expressions

Introduction

An expression, recorded in the ex table, is an object that lexically expresses a meaning in a language variety. Each expression has four properties: an ID, the ID of its language variety, a text, and a degraded text. These properties are the values of the fields in the ex table.

Distinctions

The “expression” concept differs from some similar concepts:

  • word: An expression’s text can consist of multiple words (e.g., “side effect”)
  • lexeme: An expression can be ambiguous (e.g., “lead” meaning “be ahead” and “the element Pb”)
  • lemma: An expression’s text is generally a lemma, but not necessarily (e.g., “in any event”, which might be found under “event” in a dictionary rather than given an entry of its own)

When there is no lexicalization of a meaning in a language variety, there is no expression in that variety with that meaning. There may still be an explanation of the meaning, but that is classified as a definition, rather than an expression. This classification is a matter of judgment, made by editors. Definitions typically consist of four or more words and express meanings compositionally.

Degraded texts

A trigger function (td) automatically derives the degraded text of an expression from the text whenever a new expression is created or an existing expression is modified. The degraded text could be omitted from the expression record and computed whenever needed, but it is precomputed and stored for efficiency. Users can search for all expressions with particular degraded texts or parts thereof. It would be inefficient to compute all degraded texts for each such search.

The motivation for making degraded texts available is that sources of lexical data and users specifying the texts of expressions are not always exact or consistent in their specification of the texts. Texts differing from one another in many ways (such as texts with and without hyphens, texts written as one word and as two words, texts differing only in letter case, or texts with and without diacritical marks) can be perceptually the same. With degraded texts, a search can retrieve all expressions whose degraded texts are the same as the degraded text that is specified in the search query. However, the PanLex algorithm for text degradation fails to capture some similarities (such as “color” versus “colour”, or “pant” versus “pants”). Thus, it is only one of the possible algorithms that could be used for the improvement of the intuitive efficacy of text matching.

Degradation was initially implemented for expressions, but has now been applied to definitions as well.

We consider possible improvements to the degradation algorithm as we receive suggestions. Proposals for the degradation of the Arabic script in general and for Urdu and for confusable-based degradation were published by Anshuman Pandey in May 02015. In November 02015 he commented on the degradation of Tibetan.

There is more detailed documentation on the degradation algorithm.

Meanings

Expressions in the database can have meanings, recorded in the mn table. When two or more expressions have the same meaning, they can be considered translations or synonyms of one another. Each meaning is attributed to exactly one source.

Denotations

A denotation, recorded in the dn table, is the assignment of a meaning to an expression. Therefore, each denotation is attributed to the source to which its meaning is attributed. Denotations embody the most fundamental facts in the database: the facts of lexical translations. When a user, acting as editor, assimilates a source and finds in it authority for the assertion that some two or more expressions are translations or synonyms of one another, the user records that assertion in the database by creating a meaning, associating the meaning with the source, and creating denotations which assign that meaning to the expressions (one denotation per expression).

Definitions

Any meaning can have zero or more definitions, recorded in the df table. Each definition is an arbitrary text value combined with a language variety. The intent is to assert that the text is a statement in that language variety, and that the statement explicates the meaning. Being morphologically and syntactically ignorant, the database cannot and thus does not enforce the claim that a definition is in a particular language variety.

Immutable language varieties, too, are eligible for definitions. For example, if an immutable variety contains expressions “Parent” and “Child” and some meaning’s extension is the union of all parents and all children, a definition with the text “Parent∪Child” might be construed as being in that variety.

Classifications and properties

Introduction

From 02013 to 02015 some design changes were proposed that consolidated some columns and tables into a scheme of classifications and properties. In particular, meanings, denotations, and sources could all have classifications and properties. In this section we describe how classifications and properties work in general. After that, the following sections detail their application to meanings, denotations, and sources, respectively.

Meaning and denotation classifications and properties were implemented in 02015. As of June 02016, source classifications and properties were not yet implemented. This documentation is written as if they are implemented, so that, if they become implemented, extensive editing of the documentation will not be necessary.

Description

Classifications and properties offer a method for describing optional facts about meanings, denotations, and sources.

The records of these types contain some facts, generally those that every record must have. For example, every denotation must have a meaning and an expression, so they are part of the denotation’s record in the dn table.

In addition, meanings, denotations, and sources can have optional facts of many kinds. Classifications and properties describe such facts. For example, if a denotation assigns meaning 12345 to the English expression “fair”, the source may have described that expression as an adjective, thereby distinguishing it from “fair” used as a noun, or may have labeled that meaning as “J7866”, thereby providing a reference to its entry in the source.

To describe such optional facts, it is only natural to use expressions, so, to the extent possible, that is what we do. However, for any fact one can imagine thousands of expressions, expressing it in thousands of different language varieties. If one editor uses “Adjektiv” and another editor uses “Eigenschaftswort”, both in the same language variety, or one uses “adjective” and another uses “ọ̀rọ̀ àpéjúwe”, in two different language varieties, it can be difficult to ascertain that the facts are identical.

Therefore, PanLex editors adhere to the practice of using concepticons as their preferred sources of expressions in classifications and properties. Specifically, we choose particular preferred concepticons for facts in particular domains, and, when we don’t find the right expressions there, we use or create expressions in a concepticon created by us for the purpose.

Classifications

A classification is represented either by one or two expressions. We call these classifications unary and binary, respectively. We refer to the expression representing a unary classification as a class expression. A binary classification is represented by both a class expression and a superclass expression.

Classes and superclasses in a binary classification perform various functions. For example:

  • Class: “noun”; superclass: “part of speech”
  • Class: “masculine”; superclass: “gender”
  • Class: “Argentina”; superclass: “capital city”
  • Class: “airplane”; superclass: “part”
  • Class: “Creative Commons”; superclass: “license”

In one way or another, the class expression names something related to the thing being classified, and the superclass expression specifies the way in which the thing belongs or relates to it.

Superclass expressions sometimes represent symmetric (e.g., A resembles B) or asymmetric (e.g., A was published by B) relations. If asymmetric, the directionality may be crucial. For example, if the superclass expression is “part”, it may be crucial to know whether A is a part of B or B is a part of A. One way to assure this is to choose superclass expressions that have meanings with definitions making directionality clear. Another is to use expressions designed to make directionality unambiguous, such as “IsPartOf”. Such expressions exist in some artificial language varieties (concepticons).

A unary classification represents a property or category but doesn’t further specify how it relates to the thing being classified. It says “A has property C” or “A belongs to category C”.

As a practice, we prefer binary classifications. For example, to say that a denotation has an expression whose gender is masculine, we prefer to use a classification with both “gender” superclass and “masculine” class expressions, rather than leaving “gender” to be inferred from “masculine” alone.

Binary classifications, though they give more information than unary ones, do not convey all the information that may exist about a classification. The concepticons from which classes and superclasses are usually drawn sometimes contain multi-level hierarchies. When you interpret something in a source as a classification, you may need to judge whether to choose the class’s parent or some more remote ancestor as the superclass. For example, if you choose “IndefiniteArticle” in GOLD 2010 as a denotation’s class, you may choose its parent, “Article”, or that parent’s parent, “Determiner”, or that parent’s parent, “PartOfSpeechProperty”, as the superclass. If you skip over the parent, you may inadvertently introduce ambiguity, because a concepticon might have more than one of the same expression with the same ancestor but different parents (GOLD 2010 doesn’t exhibit this problem). Whenever the PanLex team develops uniform practices in the making of such choices, we document them in the relevant specific sections of this site.

The failure of PanLex’s classification scheme to represent entire taxonomies is deliberate. Classifications are not among the main things that PanLex documents. They provide supplementary information, often less refined than what is found in sources. We use them mainly when we can easily and automatically apply them during assimilation.

Classifications of any particular object type, whether unary or binary, are stored in a single table. If the classification is unary, the superclass expression is null. This is because it is normally reasonable to interpret the superclass expression (e.g., “gender”) as the one that is implicit in a unary classification.

Properties

A property, like a classification, is a complex fact, but its components differ in part. A property always includes an expression (an attribute), but it also includes a text string (a value). We use properties when it is unreasonable to expect that there can be an expression describing a class (e.g., where the value could be any number or any sequence of symbols).

You can choose any expression as a superclass expression, class expression, or attribute. Sometimes there is no reasonable limit to the expressions that might be chosen. For example, if the superclass expression is “nominalization”, meaning that expression A is a nominalization of the class expression, then the latter is whatever expression A is a nominalization of (such as “continue” where A is “continuation”). But in other cases it is useful to limit the expressions from which we draw. Expressions often represent concepts, and it is useful to choose the same expression whenever we wish to represent a particular concept. Thus, in practice we rely on concepticons as the sources of expressions where possible. For expressions of a particular type (such as denotation superclass expressions), we sometimes rely on two concepticons: one that we have found, and another that we have created to accommodate needed expressions missing from the one we have found.

Importation

When you submit a final source file for importation and any of its classifications or properties specifies an expression that doesn’t exist, this may or may not be an error. If the language variety is mutable, PanLem creates the expression. If the language variety is immutable, you get an error message back.

Meaning classifications and properties

Introduction

Above we discuss classifications and properties in general. Additional details specific to meaning classifications and meaning properties are discussed here.

In March 02015 we received a suggestion by Michael Ellsworth of the FrameNet project that PanLex add semantic relations to the database. This and related proposals have been accommodated with meaning classifications and properties. They generalize two object types previously in the database:

  • meaning identifiers
  • domain specifications

Tables

Classifications

Meaning classifications are stored in table mcs. It has the structure

mcs serial primary key
mn integer not null
ex0 integer
ex1 integer not null

and constraints

  • mn+ex0+ex1 unique
  • mn references mn(mn)
  • ex0 references ex(ex)
  • ex1 references ex(ex)

Table mcs absorbed table dm (domain specifications).

Properties

Meaning properties are stored in table mpp. It has the structure:

mpp serial primary key
mn integer not null
ex integer not null
tt text not null

Constraints:

  • mn+ex+tt unique
  • mn references mn(mn)
  • ex references ex(ex)

Table mpp absorbed table mi (meaning identifiers).

Uses

You can use meaning classifications and properties to preserve semantic information found in sources. It is common, for example, for sources to indicate that a thing is a kind, type, or species of another thing. We see this when a source translates expression A as “type of B”. When you see this pattern, you can extract “B” as the translation while also attaching a meaning classification whose superclass expression is “IsA” in ConceptNet 5 (art-300) and whose class expression is B. Before the existence of classifications, our usual practice was to parenthesize “type of”, causing the exdftag serialization script to make “(type of) B” a definition and “B” an expression. The meaning classification is similar to such a definition, but is more tractable, since it unifies what would otherwise be many definition patterns in many languages.

Denotation classifications and properties

Introduction

Above we discuss classifications and properties in general. Additional details specific to denotation classifications and denotation properties are discussed here.

In 02013 to 02015 we received a suggestion by PanLex intern Amandalynne Paullada that PanLex might beneficially modify the design of metadata to make variables expressions instead of strings. This and related proposals have been accommodated with denotation classifications and properties. They generalize two object types previously in the database:

  • word classifications
  • metadata

Tables

Classifications

Denotation classifications are stored in table dcs. It has the structure

dcs serial primary key
dn integer not null
ex0 integer
ex1 integer not null

and constraints

  • dn+ex0+ex1 unique
  • dn references dn(dn)
  • ex0 references ex(ex)
  • ex1 references ex(ex)

Table dcs absorbed table wc (word classifications) and part of table md (metadata).

Properties

Denotation properties are stored in table dpp. It has the structure:

dpp serial primary key
dn integer not null
ex integer not null
tt text not null

Constraints:

  • dn+ex+tt unique
  • dn references dn(dn)
  • ex references ex(ex)

Table dpp absorbed part of table md (metadata).

Uses

You can use denotation classifications and properties to record information about expressions other than their lemmas. Word class (noun, verb, etc.), gender, register (formal, vulgar, etc.), tone patterns, and other information can be recorded.

In many sources, there is no formal structure to accommodate such extra information, and authors may prepend or append it to lemmas, like “Frage n.f.” or “bluenose (slang, 1920s)”. One task of PanLex editors is to extract lemmas from these strings and decide whether the remainder merits treatment as denotation classifications or properties.

Source language varieties

Sources may be associated with language varieties. Such an association, recorded in the av table, records the fact that the source documents the language variety. Such a record is redundant if the database contains all of the denotations that the source provides evidence for. But that is not true for sources that have not yet been fully assimilated.

Exemplar characters

The character encoding of the database is UTF-8. Any Unicode character can be included in a text value of any column in any table. Procedures that add and modify data further restrict text values, limiting them to values that comply with Unicode Normalization Form NFC. Values in the td column of the ex table, containing degraded expression texts, are subject to additional restrictions.

For the development of quality-control procedures, and for editorial inspection, it can be useful to classify characters in terms of whether they are normal for particular language varieties. The database maintains records, in the cu table, identifying the statuses of characters as Unicode exemplar characters in language varieties.

Approved characters

Editors of the database adhere to some conventions with respect to characters in the texts of expressions, by standardizing the orthographic forms of expressions. To record decisions that they make about such standards, editors can record, in the cp table, classifications of particular characters as approved for particular language varieties.

Source editors

Each source can be associated, in the au table, with 0 or more users. Procedures can require that particular operations on sources be performed only by users associated with them. Such users can be considered those sources’ editors.

Language variety scripts

Each language variety can be associated, in the lvsc table, with 0 or more scripts. These are the scripts (Latin, Greek, Cyrillic, Katakana, etc.) that are considered normal for the variety. Our practice, not enforced by the database, is to assure that each language variety is associated with at least 1 script.

Language variety editors

Each language variety can be associated, in the lu table, with 0 or more users. Procedures can require that particular operations on language varieties be performed only by users associated with them. Such users can be considered those language varieties’ editors.

Obsolete elements

Synonymy and ambiguity

Language varieties until May 02015 possessed properties of synonymy (column sy) and ambiguity (column sy). Their values for most language varieties were true. This meant that those language varieties permitted both synonymy and ambiguity: Multiple expressions could have the same meaning (synonymy), and a source could assign multiple meanings to the same expression (ambiguity). For some language varieties, the value of one or the other or both of these was false. If synonymy was false, PanLem deleted a denotation if a source created another denotation with the same meaning and with any other expression in that language variety. If ambiguity was false, PanLem deleted a denotation if a source created another denotation with the same expression and another meaning. We retired these properties from the database design after concluding that they had no significant value for PanLex. They had been made false for controlled language varieties that purportedly eschewed ambiguity and synonymy, but, because of the lack of semantic isomorphism between language varieties, expressions in even such varieties could reasonably be translated into multiple, semantically distinct expressions in other varieties, including other controlled varieties, thus giving rise to ambiguity and synonymy.

The language varieties that, before this retirement, had false values on ambiguity, synonymy, or both were:

  lv  | lc  | vc  | sy | am |    ex    |                          tt                          
------+-----+-----+----+----+----------+------------------------------------------------------
 1127 | art |   0 | f  | f  | 18584009 | PanLem
 1622 | art |   1 | f  | f  | 18584011 | ISO 639-3
 1623 | art |   2 | f  | f  | 18584013 | ISO 639-2/T
 1624 | art |   3 | f  | f  | 18584015 | ISO 639-2/B
 1625 | art |   4 | f  | f  | 18584017 | ISO 639-1
   41 | art |   5 | f  | f  | 18584019 | ISO 639
 1242 | art |   6 | t  | f  | 21038398 | ISO 3166 alpha
 1785 | art |  12 | f  | f  | 18584033 | Swadesh 207
 1945 | art |  15 | f  | f  | 18584039 | ISO 639-5
 5560 | art | 245 | f  | f  | 18584497 | Swadesh 100
 6708 | art | 253 | f  | f  |  5600054 | Universal Networking Language
 6712 | art | 254 | f  | f  | 18584515 | U+
 6713 | art | 255 | f  | f  | 18584517 | ISO 80000-9 Annex A
 6714 | art | 256 | f  | f  | 18584519 | SI
 6719 | art | 257 | f  | f  | 18597680 | LWT Code
 6889 | art | 260 | f  | f  | 18584526 | Swadesh 200
 6913 | art | 261 | f  | f  | 21069318 | SILCAWL
 6845 | art | 262 | f  | f  | 18584530 | ISO 15924
 6915 | art | 263 | f  | f  | 18584532 | STAR-IEML
 6920 | art | 264 | f  | f  | 18584534 | ISO/IEC 2382-14
 6953 | art | 266 | f  | f  | 18584538 | Swadesh-Gudschinsky 200
 6955 | art | 267 | f  | f  | 21069319 | ALCAM 120
 6992 | art | 268 | f  | f  | 21069320 | ABVD 210
 6993 | art | 269 | f  | f  | 21069321 | ℤ
 7017 | art | 270 | f  | f  | 18584546 | LEGO Concepticon
 7219 | art | 273 | f  | f  | 18584552 | PanLex Union Concepticon
 7257 | art | 274 | f  | f  | 18584554 | ISO 639-PanLex
 9070 | art | 276 | f  | f  | 18813575 | New International File Names for the Atomic Elements
 9092 | art | 277 | f  | f  | 21069322 | Swadesh-Yakhontov 110
 9148 | art | 280 | f  | f  |   603660 | New Basic Holle List
 9312 | art | 282 | f  | f  |  7313149 | PanLex Empirical Concepticon A
 9314 | art | 283 | f  | f  |  7313150 | PanLex Empirical Concepticon B
 1621 | art | 289 | f  | f  | 18587366 | ISO 639-3 Reference Names
 1626 | art | 290 | t  | f  | 21014105 | ISO 639-3 Print Names
(34 rows)

Meaning identifiers

Any meaning can, optionally, have a meaning identifier, recorded in the mi table. No meaning can have more than one meaning identifier. A meaning identifier usually is some code that is used by the meaning’s source to identify a particular translation.

Meaning identifiers are now represented as meaning properties with the attribute expression “identifier” in variety art-301.

Domain specifications

A meaning may optionally have domain specifications, recorded in the dm table. They are expressions. Attaching a domain specification to a meaning asserts that the meaning is within the domain described by the expression.

In earlier versions of PanLex, domain specifications were structured identically to definitions. Specifically, each was an arbitrary string specified as being in a language variety. Experience showed that the texts of domain specifications were usually identical to the texts of expressions. The design was modified to require domain specifications to be expressions. Now, in the table of domain specifications, the column identifying the specification has as its value the ID of the expression serving as the specification.

This constraint on domains is advocated by Gerard de Melo and Gerhard Weikum in their 02010 article, Towards Universal Multilingual Knowledge Bases. They say that “some knowledge bases rely on a separate vocabulary of domain labels …. We instead advocate following WordNet in using identifiers already present in the knowledge base …. This has the advantage of extensive information about the domains being readily available ….”

Because domain specifications are expressions, we consider it a good practice to select lemmatic forms as domain specifications, rather than adding expressions to PanLex for the purpose of having them act as domain specifications. For example, when assimilating a resource that uses “mammals” as an English domain specification, we use “mammal” instead.

Meaning identifiers are now represented as meaning classifications with the superclass expression “HasContext” in variety art-300.

Word classifications

Parts of speech (noun, verb, etc.) are usually attributed to words and phrases in syntactic contexts. A corresponding attribution in the database is the word classification, recorded in the wc table. Any denotation can have 0 or more word classifications. Each word classification assigns 1 word class, selected from a finite set of word classes, recorded in the wcex table, to a denotation.

Word classifications as properties of denotations are dissimilar to the uses of parts of speech by some sources. Wordnets, for example, attribute parts of speech to synsets, which correspond to PanLex meanings. Wordnets do not permit words and phrases within a synset to have distinct parts of speech, but PanLex permits expressions that share a meaning to have distinct word classifications, because each expression has its own denotation and each denotation can have its own word classification(s).

Word classifications are now represented as denotation classifications, generally with the superclass expression “PartOfSpeechProperty” in variety art-303.

Metadata

Lexicographic documentation includes the assignment of various properties to words and phrases beyond parts of speech, such as tones, gender, inflectional paradigm, illocutionary force, register, etymology, and era or period. In order to permit such information to be recorded in the database, the database permits metadata, recorded in the md table, to be assigned to denotations. A metadatum is a combination of 2 text values. One is called the variable, and the other is called the value. Unlike definitions, these values are not associated with language varieties. Editors may, at their discretion, adopt conventions in their choices of the values.

Metadata are now represented as denotation classifications and properties. Commonly used vb and vl strings were mapped to expressions, chosen from concepticons if practical. Wherever vl was mapped to an expression, the md record was converted to a dcs record. Otherwise it was converted to a dpp record. In the latter case, if its vb string had not been mapped to an expression, that string and the “=” symbol were prefixed to the vl string to produce a dpp value, and “LinguisticProperty” in art-303 (GOLD 2010) was used as the dpp attribute expression.

Leave a Reply