Applying basic PanLex text standards

IntroductionUp

There are three three basic PanLex text standards.

Unicode character encoding

The first basic PanLex text standard requires that all characters in the text be represented with their Unicode encodings.

If you are assimilating data from a source whose digital text is encoded in compliance with a different standard, you may be able to convert it to Unicode automatically. Some text editors and encoding converters (see below) can read text in various encodings, convert it, and write a new version of it in Unicode encoding forms.

You are likely, however, to encounter some sources whose encoding does not fully comply with any major standard. Creators of digital sources often improvize. You may need to investigate particular characters or install non-Unicode fonts so you can visualize the text and then determine how it has been encoded. Encoding questions are particularly likely to arise when you extract text from a PDF file.

One method for investigating encodings is to generate a frequency table that counts the number of instances of each character in a source file and to examine the table.

Once you know how the text is encoded, if you cannot find an existing conversion tool you can create your own, using global replacement of characters and character sequences. Regular expressions, hash tables in Perl, and dictionaries in Python are among the applicable tools. When including actual characters in your code is inconvenient, you can use escaped numeric representations of codepoints instead, such as \x{00df}.

UTF-8 encoding form

The second basic PanLex text standard requires that the text be stored in compliance with the UTF-8 encoding form. In fact, tools that convert non-Unicode encodings to Unicode must choose some encoding form to write the new file in, so your automatic conversion will probably deal with both these PanLex standards at the same time.

Tools that can convert from some non-Unicode encodings to Unicode and, simultaneously, UTF-8 include:

Such tools can also serve the limited purpose of converting a different Unicode encoding form, such as UTF-16, to UTF-8.

Some scripts have not yet been encoded in Unicode, so their characters have no Unicode encodings. You can’t use the non-Unicode encoding used by a source, so what can you do? If a standard conversion exists to Unicode characters that are used in lieu of the authentic characters, you can use it. If not, you can create and document a custom conversion. If and when the script is encoded in Unicode, your documentation can be the basis for a conversion from the custom encoding to the new standard encoding of the script.

The Unicode Consortium continues to encode scripts. It prioritizes this work partly on the basis of need. PanLex is an associate member of the consortium, and we can help with the prioritization decisions by documenting the existence of sources in particular scripts and our need to have those scripts encoded. You should discuss any such scripts you discover with the PanLex staff, so we can decide whether to notify the Script Encoding Initiative and the Unicode Consortium about them.

NFC normalization form

The third basic PanLex text standard requires that the text comply with Normalization Form C.

Some libraries for use by programs can do this. In Perl, for example, there is Unicode::Normalize.

You may be able to ignore this standard when assimilating data. If your data consist of Unicode text that is not NFC, for example if the character “ü” is encoded as two characters rather than one, PanLem corrects this during importation.

Leave a Reply