Assimilation Q&A

IntroductionUp

Assimilation may seem simple: You look at a source’s translations and put them into the database.

Sometimes it is indeed simple, but often it isn’t. Questions arise, and the best way to proceed isn’t clear. Here we address some of the main strategic questions.

Questions

  1. Which sources should I choose first?
  2. How should I deal with related sources?
  3. Should I change a source’s registration?
  4. What information in a source should I use?
  5. What information in a source should I generally disregard?
  6. Should I correct errors in sources?
  7. How should I treat multiple translations?
  8. How can I distinguish between expressions and definitions?

Answers

  1. Which sources should I choose first?
    • Simpler sources. They yield more translations per unit of effort.
    • Sources similar in structure to sources already assimilated. You can reuse tools on them.
    • Sources that document the least-documented language families, languages, and language varieties.
    • Sources that translate across rare combinations of languages.
    • Sources in languages and scripts that you are familiar with.
  2. Should I change a source’s registration?
    • When you analyze a source, you may discover errors in its registration, and, if so, you should correct the errors.
    • Source analysts are likely to discover that the quality, the difficulty, and the language varieties of a source differ from those in its registration.
  3. What information in a source should I use?
    1. Translations of expressions.
    2. Definitions of meanings.
    3. Other facts that distinguish meanings (e.g., describing their domains or relations to other meanings).
    4. Facts that distinguish expressions (e.g., identifying their grammatical classes or tone patterns).
    5. Unique identifiers that link PanLex data to specific locations within their sources.
  4. What information in a source should I generally disregard?
    • Nonlemmatic forms accompanying lemmatic forms of expressions.
    • Facts about words in other languages from which expressions were derived (etymologies). However, we sometimes record the language from which expressions were derived, without indicating the detailed history of the word.
    • Examples of sentences using expressions.
    • If you are analyzing a source, you are free to disregard some entire entries because of relative complexity. Alternatively, you can preserve a complex entry, without analysis, as a meaning property, so it can later be interpreted.
  5. Should I correct errors in sources?
    • If it’s easy, yes. We assimilate sources; we don’t blindly copy them.
    • If it’s laborious, only if you are doing source interpretation and entering meaning data item-by-item. If you are doing source analysis, look for important systematic errors and apply rules that correct them all at once. You can do this both during tabularization and, using the normalize script, in serialization. You should leave the miscellaneous irregular errors for post-importation correction by others.
    • If you find your source contains enough errors to deserve a lower quality value than it has, adjust its registration to decrease the quality value.
  6. How should I treat multiple translations?
    • If it’s practical, decide whether they are synonyms, and treat them accordingly.
    • If you are analyzing the source, determine whether you can define a rule for deciding whether they are synonyms; if you can’t, then treat all multiple translations in the source as non-synonyms.
    • If it appears that the source uses distinct punctuation for synonymous and non-synonymous multiple translations, such as commas for synonyms and semicolons for non-synonyms, check a sample of entries to verify that the punctuation is consistent. If it isn’t, then treat all the multiple translations as non-synonyms, so that you don’t wrongly classify numerous non-synonyms as synonyms.
  7. How can I distinguish between expressions and definitions?
    • There is no set of fixed criteria.
    • Expressions tend to be short.
    • Expressions tend to be lexicalized (individually memorized by language speakers) and are often noncompositional (impossible to understand by examining only their parts).
    • Expressions usually don’t include parenthetical parts.