PanLex Swadesh Corpora

IntroductionUp

In November 02016, the Rosetta, PanLex, and Clock projects of The Long Now Foundation joined forces to produce a rugged super-compact memento, designed to last—and remain beautiful—for thousands of years: the Rosetta Wearable Disk.

Most of the disk’s content was an archive of documentation on languages and dialects (language varieties) of the world. The archive included the Preamble of the Universal Declaration of Human Rights in about 300 language varieties, and a set of words and phrases (expressions) in about 800 language varieties.

The collection of expressions was a special excerpt from the PanLex Swadesh Corpora. Below you can learn what the PanLex Swadesh Corpora are, and how PanLex developers brought them into being.

Structure

There are two PanLex Swadesh Corpora. One contains expressions in about 800 language varieties (that’s the one on the wearable disk), and the other contains expressions in about 2,000 language varieties. You can think of them as 800- and 2,000-language dictionaries.

They have, however, a different structure from most dictionaries. In bilingual dictionaries, expressions in one language are organized alphabetically, and each of them is accompanied by a translation—or multiple translations—into the other language.

But the PanLex Swadesh Corpora collect all the expressions in any particular language variety in one place. To get a translation, you go to the page containing the expressions in a language you know. You find the desired expression there. It’s accompanied by a number. You then find a language variety that you want to translate into. There you find an expression with the same number. That expression is the translation.

Here are some excerpts with the numbers 27 and 28:

Numbers

What are those numbers that accompany the expressions? Each number represents a distinct meaning or concept. In all the lists of expressions in different language varieties, the expressions with the same number have the same meaning. For example, all the expressions numbered 27 express the meaning “big”, those numbered 28 “long”, and those numbered 82 “knee”.

You might think of those numbers as keys or indexes. We PanLex developers think of them as expressions in a special language variety. Rather than calling 82 the index of “knee”, we call “082” and “knee” and “genou” and “膝” and about 800 more expressions translations of one another.

What language variety is “082” an expression in, then? Its name is “Swadesh 207”. It’s a language variety with exactly 207 expressions, from “001” to “207”. And those 207 expressions express 207 different meanings. None of them—at least in principle—expresses more than 1 meaning, and no 2 or more of them express the same meaning. So, it’s a language variety with a fixed set of expressions, and at least in principle with no ambiguity or synonymy.

Such a language variety—fixed, synonymy-free, and ambiguity-free—is sometimes called a concepticon.

Which ones?

We chose two widely used concepticons for our NLTK corpora: Swadesh 207 and Swadesh-Yakhontov 110. The Swadesh 207 corpus contains expressions for 207 meanings in about 800 language varieties. The Swadesh-Yakhontov 110 corpus contains expressions for 110 meanings in about 2,000 language varieties.

We chose these mainly because their expressions are widely translated. We wanted to make the corpora as multilingual as we practically could.

Expansion

Current

We used two methods to increase the sizes of the PanLex Swadesh Corpora.

First, we relaxed the criterion for inclusion of a language variety. Instead of including only language varieties with translations for all 207 or 110 concepticon expressions, we included any language variety into which there were translations of at least 75% of the concepticon expressions.

Second, we used not only attested translations, but also inferred ones, with our source art:Colowick (“A Union of Concepticons”) serving as the validator of distance-3 translations.

So, suppose we are looking for translations of the Swadesh 207 expression “060”. Source art:Colowick translates it into 7 other Concepticons. One of the translations is “08.51” in the concepticon LWT Code. Another source (mul:Usher) translates “08.51” into the Okobo (okb-000) expression “íbí”. We infer that “íbí” is a translation of “060”, although none of our sources attests that translation.

How much good does it do to map concepticon expressions to each other? Suppose we decide not to count multiple translations into the same language variety. Then, as of 19 November 02016, the PanLex database contained 52,515 attested translations of Swadesh 207 expressions into other language varieties. When we used inference from our inter-concepticon mapping, we were able to increase the count of translations to 490,600, thus multiplying the translation count by more than 9.

Without inferred translations, 213 language varieties achieved our 75% threshold for Swadesh 207. With inferred translation, we were able to increase this count of language varieties to 775, thus multiplying the count by 3.6.

Future

We are considering an enhancement to the expansion of these corpora, extending the criteria for admission of expressions to include distance-2 translations generally.

Access and use

You can freely download a copy of the latest edition of the PanLex Swadesh Corpora. The first edition was published in April 02016, and the second in January 02017.

When you expand the compressed archive, you will find the file organization and format documented in a file named README. The corpora are formatted for convenient automatic parsing. Their format differs from that of the Rosetta Wearable Disk. In the corpora, each language variety has its own file, named for the PanLex uniform identifier of the language variety (such as eng-000 for English). Within the file, the concepts are identified not by numbers, but rather by position. For example, the translation of “060” is always on line 60 of the file. Where we could not find a translation, the line is blank.

More information

The corpus format is documented in detail in the README file, which also contains information about authorship and licensing.

The Rosetta Wearable Disk’s PanLex Swadesh content, and additional information about the corpora from which it is derived, are available on an introduction page, PanLex Swadesh Lists, included on the disk.