Tabularization principles | PanLex development

Up
We try to deal consistently with commonly encountered issues in tabularization. Some principles have emerged from our work.

Structural analysis

In order to tabularize data, you must analyze their structure. There is an infinite variety of structures that you might encounter. Let’s start with a simple example.

Suppose you are tabularizing the data from a source that translates Russian into Ossetian and contains an entry like this:

идти, куда глаза глядят: уырдӕм цӕуын

What is the entry’s structure? You might hypothesize that it translates between two language varieties, separating the varieties’ expressions with “: ”, and, within each variety, separates any synonyms with “, ”. Many sources have similar structures. If so, this entry can be analyzed as describing one meaning with three denotations (two with Russian expressions and one with an Ossetian expression). Another possibility is that the “, ” isn’t a structural element of the entry, but is simply one of the characters in a single (Russian) expression. If that is the case, then the entry describes one meaning with only two denotations. Investigation in this case would reveal that it’s the latter for this entry.

Does the entry translate between two expressions, or does it explain an expression in Ossetian with a definition in Russian? Here the former analysis is reasonable, since the Russian item is an idiom (meaning wander aimlessly or follow one’s nose, literally “go where the eyes look”).

If you are interpreting this source and you know both Russian and Ossetian, you can perform appropriate analyses entry by entry. If, instead, you are analyzing it, you have the task of defining rules analyzing all the entries. If the entries are inconsistently structured, as is often the case, you should define conservative rules, aiming to avoid adding incorrect data to the database (details in this section and in the section on tabularization methods). Or, alternatively, you might decide that the source is not analyzable and should be given to a qualified editor for interpretation.

Multiple meanings and synonyms

When a source translates an expression into multiple expressions, there is a fundamental decision to make about how to interpret this. It may mean that the original expression has multiple meanings, each with a distinct translation. Or it may mean that the original expression has only one meaning, but there are synonymous expressions of that meaning in the destination language variety. The PanLex database represents these facts differently, so you must tabularize a source file so as to make this distinction.

The methods for representing this distinction are detailed in the documentation on tabularization methods.

It is not always obvious whether two translations of the same expression represent different meanings. As editor, you use your judgment and knowledge and any guidance provided by the source itself.

If you are doing analysis, rather than interpretation, you make the distinction algorithmically, not by inspecting each entry. What do you do, then, if this is impossible? For example, what if the source doesn’t follow any reliable rule to distinguish between synonymous translations and multiple-meaning translations? In such a case, you degrade your algorithm to treat all multiple translations as if they expressed distinct meanings. This solution loses some synonymy information, but at least doesn’t create false synonyms.

However, in some cases you cannot even use such a degraded algorithm. Suppose an entry has this form:

A; B; C = W; X; Y; Z

Then you don’t know which expressions are translations of which other expressions, and it is safer to avoid recognizing any translations in this entry.

PanLex versus Dictionaries

Although PanLex generally follows lexicographic traditions, there are deviations.

One of them that you must deal with is symmetry. When you analyze a source that translates a lemma into a long explanation, do not consider the long explanation to be an expression. Instead, treat it as a definition in language B of the meaning that the word in language A has. If practical, you may also derive an expression from that definition and include it in the language-B expression column.

Lemmatic expressions and definitions

In its simplest form, a column of expressions represents the expressions as text strings. As a principle, the texts are lemmas. Those are the dictionary or citation forms of words or lexicalized phrases. Many languages have traditions of lexicography defining lemmatic forms. What’s lemmatic in English need not be lemmatic in some other language.

Columns of expressions do not have to be limited to expression texts. It is possible to annotate the expression texts with other information that will be processed during serialization. It is also possible to mix definitions in with expressions in the same column, if you intend to apply an appropriate rule during serialization that classifies the strings into expressions and definitions. You can append classifications and properties to expression texts, and you can parenthesize parts of a string to create a hybrid expression-plus-definition.

There is more detailed documentation on lemmatization.

Letter case

Some language varieties are written in scripts that have more than one letter case, and the orthographic conventions of these varieties include rules for case (e.g., in German all nouns are initially capitalized). Case distinctions in these languages can distinguish expressions (e.g., Polish/polish, Turkey/turkey, and March/march in English). Some sources, however, violate letter-case conventions, such as by capitalizing the initial letters of all headwords. PanLex editors correct such violations to the extent that they practically can.

Letter-case conventions are not always fully consensual. For example, authorities disagree on whether bird names should be capitalized in some languages. PanLex editors in such cases select and apply what they judge to be the prevailing convention.

Classifications and properties

We generally prefer using classifications to properties when possible, since they increase consistency and interoperability. However, some linguistic information can realistically only be captured as properties. Examples are phonetic transcriptions and language-specific grammatical categories such as “noun class 11”.

Superclass and attribute expressions, and some class expressions, are best selected from the expressions of immutable language varieties. This makes classifications and properties more uniform than they would otherwise be and leverages PanLex concepticon development to enable the translatability of classifications and properties. When we are unable to find an appropriately specific superclass or attribute expression, we generally use the expression “LinguisticProperty” from art-303 (GOLD 2010).

Existing practices for choosing are documented in the cspp directory of the PanLex tools. The file csppmap.txt contains mappings from various abbreviations to classifications and properties. The file doc.txt summarizes other classifications and properties that are not in csppmap.txt.

Immutable language varieties that have been providers of classification and property expressions in PanLex, in approximate order of authority, include:

art-303: GOLD 2010
art-301: DCMI Metadata Terms
art-001: ISO 639-3
art-302: Leipzig Glossing Rules List of Standard Abbreviations
art-306: List of Glossing Abbreviations
art-317: ISO 12620
art-316: FrameNet Relations
art-300: ConceptNet 5 Relations
art-286: GeoNames Feature Codes
art-006: ISO 3166 alpha country codes
art-324: ISO 639-3 Retired Code Elements
art-308: ISO 1087-1:2000
art-257: LWT Code
art-292: Semantic Domains
art-253: Universal Networking Language
art-321: UNL Relations
art-270: LEGO Concepticon

Introduction

The most complex stage of assimilation is usually tabularization. This tends to be the stage in which you make the most influential decisions about the data that will enter the database.

Some of the most common problems of assimilation, which you can solve in the tabularization stage, are documented in this section. By following certain principles, you can produce new data that relate well to our existing data and improve both the size and the quality of the database.

Subtasks

You can think of a tabularization task as involving two main subtasks.

The first subtask is row design. You are producing a tabular file, and it consists of rows, identical in their structure. Row design is the task of analyzing the structure of the source and deciding on an appropriate structure for the rows of your tabular file.

The second subtask is expression standardization. Within each row, some columns usually contain expressions. Putting expressions into their columns requires more than copying them from the source. Your job includes subjecting expressions to PanLex standards.