Assimilation principles

IntroductionUp

The principles that guide our assimilation work are described below and in a list of common assimilation questions and their answers.

Publication

We assimilate data into the database in order to publish the data. We make the assimilated data freely available to others for research and development, under a CC0 license, similar to releasing the data into the public domain. Does this mean that we are republishing others’ work?

As we understand this activity, it does not republish. Rather, it consults the works of thousands of authors to discover, correct, interpret, assemble, and standardize selected lexical translations that those works (which we call “sources”) have attested. We thereby make it possible to combine translation facts based on diverse sources and enable automated inference of new translations that have never been attested, something that is not possible in practice when readers look up translations in ordinary dictionaries.

We also do not hide the provenance of the data in our database. We create and store in the database documentation on sources, including bibliographic metadata and information about the statements made in the originating resources about ownership, licensing, and use rights. The metadata are retrievable for the data in any source.

As explained in our discussion of acquisition principles, this arena is controversial. Opinions differ greatly. In our opinion, our activity conforms to applicable norms and laws relating to provenance disclosure and the fair use of intellectual property.

Source selection

We begin any batch of assimilation work by selecting data to work on. Unless you have been assigned one, you begin a batch of assimilation by choosing a source. You have many to choose from. Our acquisition work has assembled a collection of about 6,000 sources, and data derived from about 2,000 of these have already been assimilated into the database, leaving about 4,000 still to be assimilated. We call these the pending sources, and they are the ones you choose among.

You might ask why you need to choose, rather than just taking the “next” source in the list. Since we apply priorities in acquisition, can’t we assume that only high-priority data have been acquired and no further prioritization is necessary?

No, we can’t assume that the prioritization has already been completed, for three main reasons:

  • The priorities can’t be applied perfectly to acquisition. For example, we often procure data before we know what language varieties they document or how amenable the data will be to efficient assimilation. So, in reality, our thousands of pending sources merit diverse priorities for assimilation.
  • You have your own personal priorities for assimilation. You know certain languages and scripts. You have mastered certain assimilation methods. You are probably a specialist in a particular strategy, and each strategy is appropriate for only some sources. All this makes some sources attractive for you to work on and makes you unqualified  (or overqualified) to work on some others.
  • Some sources belong to whole groups that organize and format their data alike. It can be optimal for you to work on such entire groups of sources in a single batch, or one after the other, reusing your methods on each.

So, your first step in assimilation is choosing among the pending sources, and the priorities applying to acquisition apply to assimilation, too. Instructions for executing this step are documented separately.

Content selection

Once you have selected at source to work on, you select parts of the content to deal with.

Primary

Expressions and their translations into other expressions, such as “organic = амылыг”, are the primary content to make use of.

Secondary

We also make use of information of some other kinds, insofar as we can efficiently do so, including:

  • Definitions, such as “etwas unbeabsichtigt offenbaren” (“to reveal something unintentionally”).
  • Other facts about meanings, such as “med.” (medicine) or “part of an automobile”.
  • Facts about expressions, such as “adj.” (adjective).
  • The source’s own Identifiers of meanings, allowing them to be located in the original documents.

Excludable

We generally disregard:

  • Nonlemmatic inflections (plurals, tenses, etc.) accompanying lemmatic forms of expressions.
  • Facts about words in other languages from which expressions were derived (etymologies). However, we sometimes record the language from which expressions were derived, without indicating the detailed history of the word.
  • Examples of sentences using expressions.
  • Entries that are too complex for efficient assimilation and not of significant value.

Deferrable

When we consider it practical, we sometimes defer assimilating entries that are too complex for efficient assimilation but are too valuable to ignore.

Manipulation

Having selected a source and parts of the source’s content, you now manipulate those parts to produce data for the database. The following principles guide that work.

Efficiency

We care about both quality and efficiency, and these two values often conflict, so we must make tradeoffs.

Even when we decide to assimilate a source in the most efficient manner possible, it is often unclear what that manner is. There are, in principle, two main strategies: interpretation and analysis.

Symmetry

Symmetry is one of the fundamental features of the PanLex database, differentiating it from most dictionaries. While a typical bilingual dictionary entry contains translations from some expression in the source language into the target language, our database contains translations between expressions.

We encounter many asymmetric translations in the data we work on. An example is “nghèo đói: starving because of poverty”. If it is practical, we transform such translations so both sides of it are expressions. In this case we might change the English content to “starving” and treat the entire phrase as a definition as well.

Synonymy

We strive to avoid adding false synonyms to the database. There is a risk of doing so when we consider multiple translations of an expression to be synonyms. For example, if the source says “glass: bardak, cam” and you conclude that “bardak” and “cam” are synonyms, you may be committing an error. When it isn’t practical to avoid errors of this kind, we are willing to make the opposite error: assuming that all multiple translations of an expression express distinct meanings. That error loses some information, but we prefer that loss to the injection of misinformation into the database.

Standardization

We standardize data while assimilating them into the database. The most important standards that we impose (insofar as practical) on data are:

  • Text compliant with the UTF-8 encoding form.
  • Expressions in lemmatic form.
  • Letter case conforming to prevailing lexicographic conventions
  • Orthographic variants unified to conform to the most legitimate standard.

For more information, see text standardization.