PanLex text standards

IntroductionUp

PanLex aims to make data from thousands of sources interoperable. If two sources translate the same expression, it should be identifiable as the same expression. To make this happen (as much as is practical), we have defined text standards and we impose those on all text that we add to the PanLex database.

We apply these standards in two main contexts:

Basic standards

All text in PanLex complies with these basic Unicode standards:

  • Unicode character encoding
  • UTF-8 encoding form
  • NFC normalization form

Aspirational standards

We also try to make text in PanLex comply with other, less-well-defined standards:

  • Language-variety-specific standard scripts
  • Language-variety-specific standard orthographies

Details

Basic standards

Unicode character encoding

All characters must be represented with their Unicode encodings.

For example, the character “Ω” (Greek capital letter omega) has the Unicode encoding x3a9 (i.e. 3a9 as a hexadecimal number), equivalent to 001110101001 (binary). Other encoding standards represent the same letter differently. For example, the Windows 1253 encoding standard represents it as xd9, and the IBM 737 encoding standard represents it as x97. These are some of the many competing encoding standards. Some encodings of “Ω” are shorter, because their encoding standards are specifically designed for Greek, but those are non-universal. The same xd9 that means “Ω” in Windows 1253 means “Щ” (Cyrillic capital letter shcha) in Windows 1251. PanLex requires “Ω” to be encoded only as x3a9.

The Unicode Consortium continues to encode scripts. It prioritizes this work partly on the basis of need. PanLex is an associate member of the consortium, and we can help with the prioritization decisions by documenting the existence of sources in particular scripts and our need to have those scripts encoded. You should discuss any such scripts you discover with the PanLex staff, so we can decide whether to notify the Script Encoding Initiative and the Unicode Consortium about them.

UTF-8 encoding form

Characters encoded in compliance with the Unicode standard are stored in digital files in various ways (called encoding forms). They can have values from 0x0 to 0x10ffff. One way to store them is to give each character 3 bytes, from 000000000000000000000000 to 000100001111111111111111 (that’s the entire range of possible Unicode character encodings, written in binary digits). But many common characters need only 1 or 2 bytes, so some encoding forms save space, at the cost of greater complexity, by giving characters variable lengths. The most common encoding form is UTF-8, and that is the one required by PanLex.

If you are assimilating a source whose Unicode digital text is stored with another encoding form, such as UTF-16, you can convert it to UTF-8.

NFC normalization form

The Unicode standard incorporates many compromises to improve compatibility with earlier standards. As a result, there are equivalences that could impair the efficacy of PanLex if we didn’t deal with them.

There are algorithms for converting characters or character sequences to other characters or character sequences considered equivalent. These algorithms are normalization forms, and the one adopted by PanLex is Normalization Form C (known as NFC). The most common thing that it does is convert sequences of a letter and some diacritical marks, such as “é”, to single characters that incorporate both the letter and the diacritical marks. They look the same, but they are encoded differently, so if they weren’t unified PanLex would include distinct forms of many expressions and its ability to translate would be impaired.

Aspirational standards

Language-specific standard scripts

Most of the world’s languages are not normally written and thus have no standard scripts. For those you should usually use whatever scripts your sources use.

Some languages that are commonly written have more than one script in use. For example, Inuktitut is commonly written in both the Latin script and Inuktitut syllabics (whose Inuktitut name is “ᖃᓂᐅᔮᖅᐸᐃᑦ”). We treat such script differences as a basis for the recognition of distinct language varieties.

Other languages have only a single common script, but are occasionally written in other scripts, such as in textbooks for foreign learners. We may choose not to use such data at all, or, if we do, to define a distinct variety of the language for such data. In simple cases it may also be possible to transliterate the deviant script into the language’s standard script.

Leave a Reply