Standard software for assimilation

IntroductionUp

The standard software that you need for assimilation depends on the strategy.

Minimal software

If you are doing interpretation, you need a web browser and a text editor, both of which must correctly display, and permit you to enter, arbitrary Unicode characters, including mixtures of left-to-right and right-to-left characters. For that purpose, you also need fonts that cover the range of Unicode characters.

If you are doing analysis, you need the same software, and more.

Web browsers

In any kind of assimilation, you sometimes need access to the PanLex database via its expert web interface, PanLem. Most current web browsers have the necessary features. If you find that a browser has trouble with PanLem, trying another browser is likely to resolve the problem.

Text editors

You may be able to use your preferred text editor for assimilation, if it supports Unicode and bidirectional text.

Some editors that we have found mostly compliant with these requirements are (“$” = non-free):

Bidirectional support

Both web browsers and text editors should support bidirectional text if they are used for assimilation. You can manage without such support if you are assimilating a source that has no text containing right-to-left characters, as it does if written in the Arabic, Hebrew, and several other scripts.

Bidirectional support is far from universal. Consider, for example, the following line from a source, displayed in 6 different text applications on an OS X host (1 = Safari, 2 = LibreOffice Writer, 3 = TextWrangler, 4 = TextEdit, 5 = Terminal, 6 = Bluefish). Of these, only TextEdit seems correct. The line begins with Arabic letters, so it should begin on the right. The Arabic letters should not appear separated by spaces. The braces should be balanced.

bidi text in 6 editors

Complex script support

Web browsers and text editors should, for our purposes, also support complex scripts. Script complexity takes various forms, but common manifestations include:

  • Letters that appear in a different order from their logical order.
  • Letters with various shapes that depend on context.
  • Diacritical marks that appear in different locations that depend on what letters they are attached to.

If your sources are written in complex scripts, you may find that some popular browsers and editors fail to support them properly.

Fonts

PanLex data can contain most Unicode characters, but some computer operating systems, as delivered, do not display some characters because of the limited repertoires of the fonts installed on them. For assimilation, you may need to install additional fonts. The most useful of these are the Noto suite.

When you assimilate a source that has a font-based pre-Unicode encoding, it can be useful to find a copy of the (non-Unicode) font on which its encoding is based and install that font. That can help you see the characters as they should appear and determine how to map each codepoint in the source file to its proper Unicode codepoint, if there isn’t an encoding converter capable of doing that for you.

Advanced software

Analysis requires more than the minimal standard software. The additional software depends on the source and on you.

Programming languages

Analysts use programming languages to design rules and apply them to source data, mainly during tabularization.

The programming languages most often used by PanLex analysts have been Perl, Python, and JavaScript. Other languages can be effective, too, if they offer similarly extensive support for the Unicode standard and for regular expressions, and if they have libraries allowing you to parse HTML/XML/SGML data.

You may find that a programming language that you want to use is installed as a standard component of your computer’s operating system. Even if so, the installed version of the language may be obsolete, and if so it may fail to support some Unicode characters that a later version supports. Updating the language may be necessary.

Database clients

If you are an advanced editor and wish to get more information from the database than PanLem allows, you may want get staff authorization to interrogate the database using SQL queries. The PanLex database is maintained in PostgreSQL, and the most common PostgreSQL client is psql.

To install psql:

  • On OS X, the psql command-line client should already be installed.
  • On Windows, run Cygwin’s setup.exe. When prompted for which packages to install, select postgresql-client from the Database category. Once the packages have finished downloading and installing, you can run psql from the Cygwin Terminal.
  • You can also access the PanLex database with pgAdmin and DBVisualizer, graphical applications available for both OS X and Windows.

When querying the database, you may find it useful to consult summary reference documents, including:

Leave a Reply