IntroductionUp
Resources take various forms, but most of the resources for PanLex, until now, have consisted of digital files, or, if not, have been converted to digital files. So, managing resources mainly involves managing files. As of June 02016, we are managing a collection of about 70,000 resource files. Since we have about 6,000 resources, there are, on average, about 12 files per resource.
We collect the files of any resource into a directory (folder), and often there is at least one subdirectory inside it. We also give most of the directories and files standard names, explained below.
Contents
Example
Suppose there is a German–Igbo dictionary on the web, titled Wörterbuch Deutsch–Ibo, by Georg Smith and Anuli Jones. It has a page with the dictionary content and two pages describing the dictionary. It is a unidirectional dictionary, so this resource yields only one source. We have downloaded the pages, registered the source, and later assimilated the relevant data from the content page into our database. Now we have the original three files and also some files that we produced while assimilating the data.
Here is how we organize the files (directories are marked with a folder icon):
deu-ibo-Smith
deu-ibo-Smith-0.html
deu-ibo-Smith-0to1.pl
deu-ibo-Smith-1.txt
deu-ibo-Smith-2.txt
deu-ibo-Smith-3.txt
deu-ibo-Smith-4.txt
deu-ibo-Smith-final.txt
secondary
deu-ibo-Smith-doc-a.html
deu-ibo-Smith-doc-b.html
deu-ibo-Smith-notes.txt
serialize.pl
There is a directory for all the files of the resource (deu-ibo-Smith
), and within it there is also a secondary
subdirectory containing the descriptive files and a file of notes we have made. The original source content file is deu-ibo-Smith-0.html
, and the other files outside the secondary
subdirectory were produced during the assimilation process.
Subdirectories
The example above contains only one subdirectory, secondary
. We routinely create such a subdirectory when we initially organize the files of a resource.
More complex resources may deserve additional subdirectories. If the resource has many files, or files with a natural grouping, you can create subdirectories reflecting this.
What if a resource contains data for two or more sources, such as in the case of a bidirectional dictionary? If the data for each source are in a distinct file or set of files, then we don’t treat this as a single resource. Instead, we distribute the files into two or more directories, one per source, and we consider these directories to belong to distinct resources. If, however, the data for multiple sources are joined in the same file(s), such as in a PDF file containing a bidirectional dictionary, we leave the file unchanged rather than trying to split it. Later, when we assimilate data from the resource, we usually do this one source at a time. At that point it may be helpful to create source-specific subdirectories for the files generated in the assimilation process, so we don’t confuse them and their names don’t collide.
We just mentioned the possibility of distributing files into two or more single-source resources when that is practical. You may find that the content files are source-specific, but the descriptive files describe all the sources. In that case we usually duplicate the descriptive files, so each resource’s directory contains a copy of them.
File types
As illustrated above, the files of a resource are a mixture of functional types. We don’t formally classify the functions performed by files, but for discussion purposes here is an intuitive classification:
- Files at the root level of the directory
- Content files (the original data)
- Assimilation files (files we produce while selecting and assimilating data)
- Scripts (programs performing source analysis)
- Tabularization scripts
- Serialization script
- Utility scripts (standard programs called by our custom scripts)
- Utility data (tables of information used by scripts)
- Version files (revisions of content files produced in assimilation)
- Intermediate version files (all version files except the last)
- Final source file (the version file that we import into the database)
- Scripts (programs performing source analysis)
- Files in the secondary subdirectory
- Documentation files (descriptive files that accompanied the content files)
- Font files (fonts that accompanied the content files)
- Note files (our notes about the resource and our assimilation of its data)
Names
Every resource directory and all of its subdirectories and files have names, in most cases chosen according to standard naming practices.
Generic names
Some of the subdirectories and files have generic names. These are names that say something about the kind of directory or file having it, but do not identify the particular resource.
Directories
The secondary
subdirectory has a generic name. If you decide to create other subdirectories, they, too, may have generic names, such as content
and html
.
Files
Some files used during assimilation have generic names:
- The main serialization script’s name is
serialize.pl
. - Utility scripts have generic names, such as
leftright.pl
andcolumn_heuristic.pl
. - Files of utility data have generic names, such as
csppmap.txt
andmcsmap.txt
.
Specific names
Any directory or file without a generic name has a specific name. That is a name that, among other things, identifies the particular resource or one of its particular sources.
Specific names are composed of segments, delimited by hyphen-minus characters. (That character is the one you get by typing “-” on a normal keyboard. There are other specialized characters named “hyphen” and “minus”.) A specific name begins with at least one language segment, followed by a differentiating segment.
Language segments
Each language segment in a specific name consists of a PanLex-compliant ISO 639 code. Language segments generally appear in the same order as the languages appear in a resource. For example, if a resource contains one source and it translates “from” Korean “into” Swahili, then any specific name will contain the substring kor-swh
.
For a bidirectional resource, one language code usually appears as both the first and last language segment. For example, the language segments for a Russian–Pilipino/Pilipino–Russian resource would be rus-fil-rus
.
If you are naming directories and files for a resource that covers two or more varieties of the same language, don’t repeat its language code. As illustrated, language segments are only that: they consist of language codes, without any variety codes.
Some resources contain data for dozens, hundreds, or even thousands of languages. Names would be unwieldy if all the languages’ codes were included. Moreover, for compact display, the name of a resource directory is limited to 60 characters. When many languages are involved, you can collapse some or most of them, or even all of them, into the code mul
. You could, for example, use spa-mul
for a resource that translates Spanish into 25 different languages.
Differentiating segment
After the language segment(s) in a specific name, the last segment is a differentiating segment. It differentiates any resource from all others whose names have the same sequence of language segments.
We choose the differentiating segment according to a priority schedule:
- Surname of the principal author
- If it contains multiple words separated by spaces, collapse them with no space between, or choose the main word alone.
- Initial letters of the main words in the title
- This is chosen if there is no author.
- Example: If the title is Wörterbuch Deutsch–Ibo and there is no known author, the differentiating segment is
WDI
.
Uniqueness
Each resource directory (such as deu-ibo-Smith
in the above example) must have a unique name. The directories are stored together within the same parent directory, so no two directories can have the same name.
What, then, if there is more than one resource documenting the same language varieties in the same order, with principal authors having the same surname? In such a case, make the differentiating segment more differentiating. If “Smith” isn’t enough, for example, because the same authors wrote both Biologisches Wörterbuch Deutsch–Ibo and Medizinisches Lexicon Deutsch–Ibo, you can use SmithMLDI
as the differentiating segment for the latter. Or, if Smith and Jones wrote one resource and Smith and Bauer wrote the other, you can use SmithB
as the latter’s differentiating segment.
Resource directory names
We give specific names, as described above, to resource directories.
Specific subdirectory names
Except for subdirectories with generic names, we give appropriate specific names to subdirectories. For example, if a resource contains a bidirectional PDF file but also numerous secondary files that are direction-specific, you can create 2 subdirectories within the secondary
directory, such as rus-fil-Андреев
and fil-rus-Андреев
for the 2 sets of files.
Content file names
General case
A digital file name conventionally contains a base and an extension, which indicates the standard format of the file. Examples of extensions are .pdf
for PDF files, .html
for HTML files, or .txt
for text files.
If a resource contains a single content file, its base consists of the directory name followed by -0
, indicating that it is version 0 of the content file. Version 0 is the version that we acquire, before we begin assimilating its data. Thereafter we produce other versions, with bases ending in -1
, -2
, etc.
A resource may contain multiple content files, each containing a chapter, or the data for words starting with a single letter of the alphabet, etc. If there are only a few, you can rename them according to our standard, and also differentiate them with letters before the version numbers, as in:
deu-ibo-Smith-a-0.pdf
deu-ibo-Smith-b-0.pdf
However, if there are many content files, as happens when the words starting with each letter are published in a distinct file, such renaming is cumbersome, and you may leave the files named as they are, located in a subdirectory.
Resources often exist in multiple formats, either when we procure them or later. Their extensions usually indicate their formats. For example, if an attempt has been made to recognize the text in a PDF image file named deu-ibo-Smith-0.pdf
with Abbyy FineReader, that program might give one of its output files the name deu-ibo-Smith-0_ABBYY_Basic.xml
. If you wish, you can leave that name unchanged, since it conveys the relevant information.
Internet Archive case
Resources procured in partnership with the Internet Archive have specially named files, unchanged from the names given to them by the Internet Archive. We keep 4 files from each such source and name them as in this example:
klaravortaroespe00abra.pdf klaravortaroespe00abra_abbyy.gz klaravortaroespe00abra_djvu.txt klaravortaroespe00abra_djvu.xml
When assimilating data from such a resource, we can decide which of these files to use and designate it as version 0 by appending -0
to the base of its name.
Program file names
A tabularization program generally reads in a file and writes out the next version of the file. Its name shows this by sandwiching the versions together with to
between them at the end of the base. A Python tabularization script that works on version 2 and creates version 3 of the above-mentioned dictionary would have the name deu-ibo-Smith-2to3.py
.
Secondary file names
Descriptive files are usually named with the directory name followed by -doc
, and then, if necessary, a differentiating letter, as illustrated in the example.
deu-ibo-Smith-doc-a.html
deu-ibo-Smith-doc-b.html
Files containing our own notes are usually named with the directory name followed by -notes
, and then an extension, as shown in the example.