Acquisition principles | PanLex development

IntroductionUp

PanLex, like any project that manages a collection of things, acquires them in accord with particular principles.

Lexical translations

What do we mainly acquire?

The answer is lexical translations. Let’s clarify that.

We collect data that translate words and word-like phrases, not whole sentences or paragraphs. We call these words and word-like phrases expressions.
Words and word-like phrases, in many languages, can occur in various inflections (for number, case, etc.) to comply with grammatical rules. We collect data on the translation of the standard form used in lists of words. This is called the “citation form”, the “dictionary form”, the “lemmatic form”, or the “lemma”. Roughly (but not entirely), one PanLex expression corresponds to one “lexeme”.
When we say “word-like phrases”, we include phrases that need their own translations because you can’t get a good translation by just translating the individual words. One can debate whether “orange juice” qualifies, but obviously “sweet potato” does.
The data “translate” expressions, but not necessarily into a different language. Therefore, in common parlance, PanLex data include both translations and synonyms.
Translations can be unilingual (synonyms), bilingual, or multilingual. For PanLex, multilingual translations are the most useful. Consider that, if you find a translation of an expression into 20 different languages, then you are really finding not 20, but rather 210 bilingual translations. Since PanLex aims to translate expressions from any language into any language, multilingual translations are the most valuable.

Priorities

The PanLex database already contains more than a billion lexical translations (if you count each pair of expressions between which there is at least one attested translation).

But there is an enormous disparity among the world’s languages. The counts of expressions in the PanLex database per language variety range from about 2 million to zero. There are about 2,000 languages (recognized by the ISO 639-3 standard) with no expressions in the database. This disparity largely reflects the unequal statuses of languages in the world, with most languages being “low-density” (i.e. known and used by relatively few persons and endowed with relatively little linguistic documentation). Many of these are “threatened”, “endangered”, or “moribund”.

How, then, do we prioritize acquisition?

The main thing we aim for is coverage. We want to be able to translate any expression from any variety of any language in the world into at least one other language variety, and from there eventually into all other language varieties. So, the highest-priority data are translations in which at least one of the language varieties is currently undocumented or barely documented in our database.
A second priority is diversity. We make that concrete by taking note of the family groupings of varieties of languages, such as those classified by Glottolog and Ethnologue. Acquiring data on languages of poorly documented families has a high priority.
A third priority is laterality. Our system is designed to make use of translations among all language varieties in any combination. The quality of our data is boosted if there are multiple translation paths from any expression to any other. In some other project a Laz–English dictionary may be more useful than a Laz–Georgian dictionary, but for PanLex both are useful, and, if we already have many Laz-English translations, then the Laz–Georgian dictionary is more useful.
We also value quality. Resources created by qualified scholars and works that have gone through edited revisions tend to be more useful than resources that are anonymous are based on the testimony of a single speaker.
We can’t ignore efficiency, because of our limited funds. We prioritize data that can be assimilated automatically. Digital files in text formats, and printed pages that are easily converted to such files, have high priority. Sometimes we find a hard-to-convert resource, but the author agrees to give us a more tractable version of it.

Sources

We mostly don’t do our own translation; we collect translations attested by others.

Translations can be collected one at a time, or in sets. Up to now, almost all our acquisition has been in sets, and this site documents acquisition only of that kind. We call our sets of data sources.

We acquire sources by procuring resources. An example of a resource is a bilingual dictionary published as a PDF file, containing publication information, an introduction, a main section from language A into language B with detailed translations, and a section from language B into language A with only the lemmatic forms. We treat that resource as providing not one, but two sources: an A–B source and a B–A source.

When you procure a resource, you also document the sources that it provides. Your documentation goes into the database and is retrievable along with the data once the data have been assimilated. In this way, you are documenting the “provenance” of our data.

Ownership

When we procure copies of documentary resources attesting lexical translations, we store those copies for our own use. We protect our copies from public access. Do we need to obtain permission, or ownership of data, before we do this?

There is controversy, even litigation, about the ownership of information. PanLex acquires data amidst uncertainty and change. We believe that our acquisition principles are well-founded in the law of fair use of copyrighted works and otherwise, but nobody’s principles are universally approved. We take a similar position with respect to our assimilation of data that we find in acquired sources.