Resource organization | PanLex development

IntroductionUp

Resources take various forms, but most of the resources for PanLex, until now, have consisted of digital files, or, if not, have been converted to digital files. So, managing resources mainly involves managing files. As of June 02016, we are managing a collection of about 70,000 resource files. Since we have about 6,000 resources, there are, on average, about 12 files per resource.

We collect the files of any resource into a directory (folder), and often there is at least one subdirectory inside it. We also give most of the directories and files standard names, explained below.

Contents

Example
Subdirectories
File types
Names
- Generic names
  - Directories
  - Files
- Specific names

Example

Suppose there is a German–Igbo dictionary on the web, titled Wörterbuch Deutsch–Ibo, by Georg Smith and Anuli Jones. It has a page with the dictionary content and two pages describing the dictionary. It is a unidirectional dictionary, so this resource yields only one source. We have downloaded the pages, registered the source, and later assimilated the relevant data from the content page into our database. Now we have the original three files and also some files that we produced while assimilating the data.

Here is how we organize the files (directories are marked with a folder icon):

deu-ibo-Smith
- deu-ibo-Smith-0.html
- deu-ibo-Smith-0to1.pl
- deu-ibo-Smith-1.txt
- deu-ibo-Smith-2.txt
- deu-ibo-Smith-3.txt
- deu-ibo-Smith-4.txt
- deu-ibo-Smith-final.txt
- secondary
  - deu-ibo-Smith-doc-a.html
  - deu-ibo-Smith-doc-b.html
  - deu-ibo-Smith-notes.txt
- serialize.pl

There is a directory for all the files of the resource (deu-ibo-Smith), and within it there is also a secondary subdirectory containing the descriptive files and a file of notes we have made. The original source content file is deu-ibo-Smith-0.html, and the other files outside the secondary subdirectory were produced during the assimilation process.

Subdirectories

The example above contains only one subdirectory, secondary. We routinely create such a subdirectory when we initially organize the files of a resource.

More complex resources may deserve additional subdirectories. If the resource has many files, or files with a natural grouping, you can create subdirectories reflecting this.

What if a resource contains data for two or more sources, such as in the case of a bidirectional dictionary? If the data for each source are in a distinct file or set of files, then we don’t treat this as a single resource. Instead, we distribute the files into two or more directories, one per source, and we consider these directories to belong to distinct resources. If, however, the data for multiple sources are joined in the same file(s), such as in a PDF file containing a bidirectional dictionary, we leave the file unchanged rather than trying to split it. Later, when we assimilate data from the resource, we usually do this one source at a time. At that point it may be helpful to create source-specific subdirectories for the files generated in the assimilation process, so we don’t confuse them and their names don’t collide.

We just mentioned the possibility of distributing files into two or more single-source resources when that is practical. You may find that the content files are source-specific, but the descriptive files describe all the sources. In that case we usually duplicate the descriptive files, so each resource’s directory contains a copy of them.

File types

As illustrated above, the files of a resource are a mixture of functional types. We don’t formally classify the functions performed by files, but for discussion purposes here is an intuitive classification:

Files at the root level of the directory
- Content files (the original data)
- Assimilation files (files we produce while selecting and assimilating data)
  - Scripts (programs performing source analysis)
    - Tabularization scripts
    - Serialization script
    - Utility scripts (standard programs called by our custom scripts)
  - Utility data (tables of information used by scripts)
  - Version files (revisions of content files produced in assimilation)
    - Intermediate version files (all version files except the last)
    - Final source file (the version file that we import into the database)
Files in the secondary subdirectory
- Documentation files (descriptive files that accompanied the content files)
- Font files (fonts that accompanied the content files)
- Note files (our notes about the resource and our assimilation of its data)

Names

Every resource directory and all of its subdirectories and files have names, in most cases chosen according to standard naming practices.

Generic names

Some of the subdirectories and files have generic names. These are names that say something about the kind of directory or file having it, but do not identify the particular resource.

Directories

The secondary subdirectory has a generic name. If you decide to create other subdirectories, they, too, may have generic names, such as content and html.

Files

Some files used during assimilation have generic names:

The main serialization script’s name is serialize.pl.
Utility scripts have generic names, such as leftright.pl and column_heuristic.pl.
Files of utility data have generic names, such as csppmap.txt and mcsmap.txt.

Specific names

Any directory or file without a generic name has a specific name. That is a name that, among other things, identifies the particular resource or one of its particular sources.

Specific names are composed of segments, delimited by hyphen-minus characters. (That character is the one you get by typing “-” on a normal keyboard. There are other specialized characters named “hyphen” and “minus”.) A specific name begins with at least one language segment, followed by a differentiating segment.

Language segments

Each language segment in a specific name consists of a PanLex-compliant ISO 639 code. Language segments generally appear in the same order as the languages appear in a resource. For example, if a resource contains one source and it translates “from” Korean “into” Swahili, then any specific name will contain the substring kor-swh.

For a bidirectional resource, one language code usually appears as both the first and last language segment. For example, the language segments for a Russian–Pilipino/Pilipino–Russian resource would be rus-fil-rus.

If you are naming directories and files for a resource that covers two or more varieties of the same language, don’t repeat its language code. As illustrated, language segments are only that: they consist of language codes, without any variety codes.

Some resources contain data for dozens, hundreds, or even thousands of languages. Names would be unwieldy if all the languages’ codes were included. Moreover, for compact display, the name of a resource directory is limited to 60 characters. When many languages are involved, you can collapse some or most of them, or even all of them, into the code mul. You could, for example, use spa-mul for a resource that translates Spanish into 25 different languages.

Differentiating segment

After the language segment(s) in a specific name, the last segment is a differentiating segment. It differentiates any resource from all others whose names have the same sequence of language segments.

We choose the differentiating segment according to a priority schedule:

Surname of the principal author
- If it contains multiple words separated by spaces, collapse them with no space between, or choose the main word alone.
Initial letters of the main words in the title
- This is chosen if there is no author.
- Example: If the title is Wörterbuch Deutsch–Ibo and there is no known author, the differentiating segment is WDI.

Uniqueness

Each resource directory (such as deu-ibo-Smith in the above example) must have a unique name. The directories are stored together within the same parent directory, so no two directories can have the same name.

What, then, if there is more than one resource documenting the same language varieties in the same order, with principal authors having the same surname? In such a case, make the differentiating segment more differentiating. If “Smith” isn’t enough, for example, because the same authors wrote both Biologisches Wörterbuch Deutsch–Ibo and Medizinisches Lexicon Deutsch–Ibo, you can use SmithMLDI as the differentiating segment for the latter. Or, if Smith and Jones wrote one resource and Smith and Bauer wrote the other, you can use SmithB as the latter’s differentiating segment.

Resource directory names

We give specific names, as described above, to resource directories.

Specific subdirectory names

Except for subdirectories with generic names, we give appropriate specific names to subdirectories. For example, if a resource contains a bidirectional PDF file but also numerous secondary files that are direction-specific, you can create 2 subdirectories within the secondary directory, such as rus-fil-Андреев and fil-rus-Андреев for the 2 sets of files.

Content file names

General case

A digital file name conventionally contains a base and an extension, which indicates the standard format of the file. Examples of extensions are .pdf for PDF files, .html for HTML files, or .txt for text files.

If a resource contains a single content file, its base consists of the directory name followed by -0, indicating that it is version 0 of the content file. Version 0 is the version that we acquire, before we begin assimilating its data. Thereafter we produce other versions, with bases ending in -1, -2, etc.

A resource may contain multiple content files, each containing a chapter, or the data for words starting with a single letter of the alphabet, etc. If there are only a few, you can rename them according to our standard, and also differentiate them with letters before the version numbers, as in:

deu-ibo-Smith-a-0.pdf
deu-ibo-Smith-b-0.pdf

However, if there are many content files, as happens when the words starting with each letter are published in a distinct file, such renaming is cumbersome, and you may leave the files named as they are, located in a subdirectory.

Resources often exist in multiple formats, either when we procure them or later. Their extensions usually indicate their formats. For example, if an attempt has been made to recognize the text in a PDF image file named deu-ibo-Smith-0.pdf with Abbyy FineReader, that program might give one of its output files the name deu-ibo-Smith-0_ABBYY_Basic.xml. If you wish, you can leave that name unchanged, since it conveys the relevant information.

Internet Archive case

Resources procured in partnership with the Internet Archive have specially named files, unchanged from the names given to them by the Internet Archive. We keep 4 files from each such source and name them as in this example:

klaravortaroespe00abra.pdf
klaravortaroespe00abra_abbyy.gz
klaravortaroespe00abra_djvu.txt
klaravortaroespe00abra_djvu.xml

When assimilating data from such a resource, we can decide which of these files to use and designate it as version 0 by appending -0 to the base of its name.

Program file names

A tabularization program generally reads in a file and writes out the next version of the file. Its name shows this by sandwiching the versions together with to between them at the end of the base. A Python tabularization script that works on version 2 and creates version 3 of the above-mentioned dictionary would have the name deu-ibo-Smith-2to3.py.

Secondary file names

Descriptive files are usually named with the directory name followed by -doc, and then, if necessary, a differentiating letter, as illustrated in the example.

deu-ibo-Smith-doc-a.html
deu-ibo-Smith-doc-b.html

Files containing our own notes are usually named with the directory name followed by -notes, and then an extension, as shown in the example.