Source registration | PanLex development

IntroductionUp

The data in the PanLex database are based on sources. Each fact in the database can be traced to a particular source. Just as a library catalog contains records of its books, the PanLex database contains records of its sources. When you acquire a resource or assimilate data from a source, your work includes creating or amending one or more source records. We call this activity source registration.

You can use either PanLem or our acquisition management system, based on Wrike, to register sources. These systems provide distinct interfaces for the same purpose.

Registration context

If you are acquiring a source for PanLex, you or another team member should register it promptly, so it will be known that we have it. If it is a physical source and will be shipped, it should be registered when you make the acquisition decision; later, when it arrives, you should inspect it and, as necessary, correct or supplement the registration.

If you assimilate data from a source, you need to deal with its registration:

If you yourself have just found the source, you or another team member should register it before you assimilate it.
If the source has already been acquired and registered, you should make any needed corrections in its registration. These may include a corrected quality, complexity, or declared language variety.

Permissions

Any user of PanLem who is logged in may register a new source.

Any user whose administrative scope is 1 or 3 may edit the registration of an existing source. Your administrative scope is shown by PanLem labeled with whole—edit in your user profile, which you can see by selecting person: see on the main navigation page.

If your administrative scope is 0 or 2, you may edit the registration of any source of which you are a meaning editor. To become a meaning editor of a source, select source: see and then the source. Issue a request to edit. If your administrative scope is 0, the PanLex staff will act on the request; if 2, your request will be immediately granted by PanLem.

Conventions

Before you register sources, look at some existing registrations in PanLem and observe the conventions being followed. PanLem enforces some rules, but doesn’t enforce all our conventions. For high quality, you should generally apply the existing conventions when you register sources.

Registration form

To register a newly acquired source, select source: new on the main navigation page.

If the source is already registered and you have permission to edit its registration (see above), you may do so by selecting source: edit.

The source registration form has three parts: the public part, the editorial part, and the list part. The only fields that are compulsory are those highlighted in green, but it is desirable to complete as much of the form as you can.

Some of the names of registration fields differ between PanLem and the acquisition management system. The latter names are shown in brackets below, when different.

Public part

In the public part of the form are facts about the source that are accessible by the public.

name [label]

This is the source’s label. The instructions for making the label are almost completely identical to those for the specific name of the resource directory, so we don’t repeat them here. But there are 2 differences:

Between the last language segment and the differentiating segment, do not use a hyphen-minus. Instead, use a colon. For example, if the directory name is deu-ibo-Smith, the label is deu-ibo:Smith. Thus, deu-ibo:Smith refers to a source; but deu-ibo-Smith refers to the resource directory.
Shorten the segments as needed in order to fit the label into its 30-character limit, shorter than the 60-character limit for directory names. You can do this by:
- Truncating the surname, such as changing “Neubauer” to “Neub”.
- Collapsing two or more language codes to mul.

Each source label must be unique. If necessary, you can append an additional disambiguating identifier to the end of the label, as discussed in the documentation on specific names of resource directories.

World Wide Web [URL]

This is the URL (including http://, https://, or ftp://) where the source can be obtained. It can be the URL of the source itself, or of the page containing a link to the source. If it is a printed source, it can be the URL of a page where the publication can be ordered or a page that describes its location in a library catalog.

ISBN

This is the source’s International Standard Book Number, if any. Any hyphens and spaces are to be omitted. If the source has both a 10- and a 13-digit ISBN, the 13-digit one is preferred.

author

This is a list of the source’s authors, if known, separated with a semicolon and space (First Author; Second Author). Each author’s name is presented in its usual prose form, which may be in any script normally used for that author’s name. Thus, don’t transliterate a name into the Latin script if you have a more authentic rendition of it. Also, don’t invert a name. The surname may be first or last, whichever is customary, but don’t change “John Smith” to “Smith, John”.

What about author couples? If the title page shows two authors, apparently members of the same family, with the same surname appearing only once, split them into two fully named authors.

What if the authors don’t all fit? You can abbreviate forenames into their initials in order to include all the authors. Alternatively, you can include the most important authors and omit those whose names don’t fit.

In some cases, an author’s name is given in more than one language. In such a case you should include each version of the name, separated with a tilde (雲井昭善~Kumoi Shōzen).

title

This is the publication’s title, in its original language and script. Dictionaries often have more than one title, typically titles in more than one language. In such a case you should include each title, separated with a tilde (Twi–German Dictionary~Wörterbuch Twi–Deutsch).

It is our practice to impose some standardization on titles. Our usual conversions include:

En dashes (“–”) as the delimiters between language names (e.g., “Wörterbuch Twi — Deutsch” → “Wörterbuch Twi–Deutsch”).
Apostrophes that are standard for the language of the title.
Standard letter case for the language of the title.

publisher

This is the person or organization that publishes the source (or makes it available), or, if the source is an article in a serial, the serial title, volume, and pages of the article. The city of publication is not included. If it is jointly published by two or more publishers, separate them with semicolons and spaces, as with authors. To handle multiple publishers or a publisher name in multiple languages, follow the model for “author” above.

year

This is the year of publication. If a range of years is given, choose the final year, and put the range of years into the other field. If you can’t determine the year of publication, use the year in which you acquire the source, and indicate in the other field that this is what the year signifies. This helps us determine later whether the source has been revised since we last acquired it.

good—number [quality]

This is the estimated quality of the source, on a scale from 0 (worthless) to 9 (impeccable).

Why is quality important? Mainly because it helps estimate the qualities of translations that users obtain from PanLex. Translations attested by higher-quality sources deserve more trust. Source quality therefore affects the translation quality scores computed by PanLex.

We can’t give you a precise rule for estimating quality. But we have some rules of thumb. The quality of a source is usually higher if:

The author is a recognized, credentialed expert.
The source is recent, improving on earlier sources.
The publication is a final, edited version, rather than a preliminary or working draft.
The author does not warn the reader that there are probably many errors in this draft.
You yourself don’t see many errors (such as misspellings) in the source.
If the source is printed (or the source files contain images of printed pages), the print is dark and clear, so even small details, such as the distinction between a tilde and a macron (as in “ã” versus “ā”), can be ascertained.
The format and structure are consistent, so it is always clear what is what. For example, in a consistent source, you might see that:
- Commas always separate synonyms, and semicolons always separate translations of distinct meanings.
- Punctuation marks that separate expressions (we call these delimiters) are never used as parts of expressions.
- Punctuation marks that group or classify text, such as parentheses, brackets, braces, slashes, and hyphens, are always used for the same purposes.
- Fonts and faces (bold, italic, sans-serif, etc.) are used consistently.
- Abbreviations are spelled and located consistently.
The work adheres to prevailing lexicographic standards. For example:
- In English, verbs don’t start with “to” and nouns don’t start with “the”.
- In Hebrew and Arabic, vowels are unmarked.
- In Russian, stress is not marked.
- In tonal languages, the marking or non-marking of tone conforms to the standard practice.
- If there is a standard orthography for a language, the source uses it.
- In scripts with letter case, the author does not routinely start every expression with a capital letter (which would destroy the distinction between common words and proper names).
- Classifications (such as part of speech) are used normally.
The source almost always translates expressions into expressions, rather than only into definitions. This attribute may not indicate high quality for most lexicographers, because they want translations that describe an expression’s meaning fully. But PanLex needs translations into expressions, in order to achieve its mission. So for us this is an important element of quality.

A quality estimate is compulsory when you register a source. So, what if you haven’t even seen the source you are registering? Then use whatever information you have to estimate the quality. If you expect to need to revise the estimate after the source arrives, also mark the estimate as needing reconsideration.

source—primary [group]

In most cases, you should leave this blank. It will get filled in automatically.

This field exists because some sources are closely related to each other and are not really independent. Typical examples are bilingual dictionaries that contain two halves, translating in opposite directions. We treat those as two distinct sources, but they don’t deserve to be counted twice when they agree on facts, because it’s really the same author saying the same thing twice.

Group such sources together when you register them. Here’s how:

Register one of the sources first (usually it’s whichever source appears first, if they’re in the same book). When you do so, in other say where in the book the source is located, such as “content = pp. 13–144 of book”.
Copy that source (in PanLem) or duplicate the source task (in Wrike).
Correct the new source’s label. Usually, if the first source was aaa-bbb:Author, the second source will be bbb-aaa:Author.
Correct the page range under other.
In PanLem, the value of source—primary remains the ID of the first source. In Wrike, enter the ID of the first source for “group”. This indicates that these two sources belong to a single group.

There is more information about multi-source books below, under file—address.

other [note]

This is a field for miscellaneous facts. Among the facts that belong here are publisher’s ID facts (“A-374”), edition facts (“Third edition”) and the locations of the PanLex-related content (“content = pp. 24–69 of book”). In this last example, the “of book” or “of file” is meaningful, because a page range in the original book likely differs from the range of those pages in a PDF file that is made from the book.

permission—kind [license]

This field, although not mandatory, is important for some users of PanLex. Choose one of the two-letter codes shown below, depending on what the source says about permission to use it.

nr: in the public domain by explicit statement, or because it is old enough
cc: Creative Commons license or CC0 release
rp: permission available on request
gp: GNU General Public License
gl: GNU Lesser General Public License
gd: GNU Free Documentation License
mi: MIT License
co: copyright notice, when there is no basis for another code
pl: permission granted to PanLex to use the source in PanLex
zz: other permission or license statement
na: no permission or license statement or copyright notice

right—text [IP claim]

This field is for the text of any permission or license statement found in the source.

What if the statement merely states that the publisher has a copyright on the work, as of the publication year? In that case, you can leave it empty.

What if the statement is too long to fit? Then you may omit portions to make it fit. We don’t claim that this field contains the full exact text; those who want that can follow the URL to the original resource.

right—person [IP claimant]

This field is for the name of a person, if the source identifies a person who should be contacted about permission to use the source.

right—email—address [IP email]

This field is for the email address to which the source says inquiries about permissions should be directed.

Editorial part

The editorial part of the form contains facts that are accessible to PanLex editors, not to the public. This restriction is based only on the assumption that only PanLex editors would likely find these facts useful, not on any need to keep the facts confidential.

good—number—edit—necessary

This field should have the value 1 if you judge that the estimated quality (good—number) needs to be reconsidered (such as if you haven’t yet seen the resource), or 0 if not.

file—difficult [difficulty]

This is where you can estimate the difficulty of consulting the source, on a scale from 0 (takes almost no work) to 9 (one of the most difficult).

Why do we care about this estimate? It greatly helps us plan the assimilation of the source’s data. We need to decide who should analyze the source. If its difficulty is in the 0-to-2 range, we’ll usually choose a beginning programmer; if in the 3–5 range, an advanced programmer; if in the 7–9 range, nobody, because it should be interpreted, not analyzed. If it has a difficulty of 6, either style of assimilation may be better. (Those are approximate guides.)

When analysts review possible sources to consult, they can specify a difficulty range, and they will see only sources in that range. This speeds their selection.

But there are problems estimating a source’s difficulty:

It’s not easy. Human beings are notoriously bad at estimating how long it will take them to do something or how hard it will be.
Persons who acquire sources register them, but usually are not programmers, so it’s especially hard for them to estimate how hard a source will be for a programmer to analyze.

To deal with these problems, you have some options. When registering a source, you can:

Estimate its difficulty on the full 0-to-9 scale.
Estimate its difficulty on a smaller 7-to-9 scale, or leave it blank.
Refrain from estimating its difficulty (leave it blank).

If you can estimate on the full 0-to-9 scale, please do. If you can’t, we prefer that you determine whether the source is clearly impractical to analyze, and, if it is, then rate its difficulty as 7, 8, or 9, but otherwise leave it blank.

The distinction among difficulties 7, 8, and 9 is how easy or hard it will be to find somebody to interpret the source. This mainly means finding somebody who can select desired items in the source, standardize them, and enter them with a keyboard into a form or file. Ideally, this person should know the languages of the source, or, if not, then at least be able to read and type the scripts in which those languages are written. Here is a rough guide:

7 if the text is all in Latin script.
8 if the text is partly in relatively simple alphabetic scripts, such as Cyrillic, Greek, Georgian, and Armenian.
9 if the text is partly in complex or non-alphabetic scripts, such as Arabic, Devanagari, Hangul, Tibetan, Mongolian, or Han.

Here are some further rules of thumb for deciding whether a source is practical to analyze, or what its difficulty estimate should be:

If an analyst will need no more than an hour to analyze a source, it deserves a 0. Add 1 for each additional hour, until you reach 6.
Don’t ignore difficulties of text standardization, normalization, and lemmatization.
Treat inconsistency in source structure as a major cause of difficulty.
If the file format is PDF, do a quick test by selecting a portion, copying it, and pasting it into a plain text editor. Do the characters come out right? Is the column flow correct? If not, the source is difficult.
If the file format is PDF-image, JPEG, PNG, etc., estimate how accurately the characters can be automatically recognized. Latin, Cyrillic, Kana, and Han can, but others less so. Letters with diacritical marks are error-prone. Recognition suffers when the font is small or the contrast is poor. Also, if item classification depends on font faces and column alignments, difficulty increases, because those features often are lost or distorted in automatic recognition.

file—submit—necessary

This field has the value 1 if it is necessary to assimilate the source and submit data from it for importation into the database, and 0 if not. The initial value is always 1, because when you register a source it (presumably) hasn’t been assimilated yet. When you create a final source file out of it and submit that file for importation into the database, the value automatically becomes 0. You can make it 1 again if you determine that the source needs to be re-assimilated to make improvements or make use of a revised version. If you do that, you should also put a note into fact—other (see below) to let others know why it is to be re-assimilated.

denotation [denotation count estimate]

This field contains an estimate of the number of denotations in a source at the time of registration.

You can use any reasonably accurate method to estimate this number, but don’t take more than a few minutes. Generally you will use sampling: choose a sample of files, pages, or entries, count denotations in the sample, and multiply by the approximate ratio of the total to the sample.

The denotation count may not be identical to the expression count, because an expression can have more than one meaning, and each assignment of a meaning to an expression is a denotation. For example, consider these entries:

awrava = nad, nahoře, přes, nahoru
axrânen = přestat, zanechat, upustit od
aye = ano

The count of denotations in these three entries is 11, if the commas delimit synonyms. If, instead, comma-delimited translations will be treated as expressing distinct meanings, then there are 16 denotations, because there are 8 meanings, each having 2 denotations.

The rules that determine the count of denotations are part of the principles for tabularization. If you are registering a source during acquisition, you are not expected to master these rules. A reasonable guess is good enough.

file—address [directory]

This field contains the name of the directory containing the files of the resource, which we call the directory name. The directory name must be unique for each resource. If necessary, you can append an additional disambiguating identifier to the end of the directory name in the same manner as with the source label.

fact—other [note]

This field is for facts not to be made public but to be made available to PanLex editors. They can include notes about difficulties to be anticipated, such as “requires letter-case normalization” when the source capitalizes all expressions.

List part

language [language varieties]

Analyzing a source’s languages

Determining the languages that a source documents is sometimes difficult. You should generally try to ensure that you have made reasonable determinations about the most appropriate language codes.

Analyzing a source’s language varieties

If you are acquiring the source, you should make some effort to identify appropriate variety codes, but you don’t need to achieve perfection. For example, sources often contain transliterations in additional scripts; are these merely descriptions of pronunciations, or are they expressions in other varieties of the language? If it is difficult to make this determination, and you are acquiring the source, you can leave this decision to be made during assimilation and ignore the transliteration column.

If you are assimilating data from a source, you make the final decisions about the language varieties that it documents. These include decisions on whether multiple representations of expressions should be classified as distinct language varieties, or rather as a single language variety accompanied by additional information (such as pronunciation) about each denotation.

Declaring a source’s language varieties

The language list is a list of language varieties that you declare to be documented by the source. The list you see contains only the varieties of the languages that appear in the source’s file address, unless the address contains mul, and in that case the list contains all varieties in PanLex.

If you are using the acquisition management system, you will not be presented with a list to choose from; instead, you will type in the list of the language varieties, separated by a comma. You should also record the source’s names for these language varieties (see below).

You may be registering a source that documents a language variety that isn’t yet in PanLex. If so, then before you finish registering the source you need to return to the main navigation page and select language: new, so you can add any missing language varieties to PanLex. Then, after adding them, you can then select source: edit on the main navigation page and edit the source’s registration. That will allow you to add the newly registered language varieties to the source’s language-variety list.

Creating a source’s language varieties

Some rules apply to this interlude, in which you add a language variety to PanLex before finishing the registration of a source. These rules govern how you specify the 3 elements of the language variety’s registration. Those elements are:

Language code: The language code must be PanLex-compliant. All ISO 639-2, -3, and -5 alpha-3 codes are PanLex-compliant. You should choose the code that you consider most appropriate for the language variety. Sometimes this is not obvious. Ethnologue and Glottolog often have information helpful with this. The source itself may contain documentation on the affiliations, locations, etc., of the language(s) that it documents, and this, too, may help. If a precise language code cannot be found, seek a code that pertains to the region or family. See examples here.
Default name: This is the official PanLex name of the variety. If possible, choose the autoglossonym, i.e. the name of the variety in the variety itself, written in the variety’s script. Autoglossonyms are often documented in the English, Russian, or Spanish edition of Wikipedia, and you can also check the source itself, which may have an entry for the language’s name. If you can’t find an autoglossonym, you may choose any other language variety’s name for this variety, but, if you do that, you must check the name—language—other box so you can specify what language variety the name is in. In the cases of some, especially artificial, languages, it is reasonable to assert that they have universal names, invariant across languages. If so, you can specify mul-000 as the language variety these names are in.
Script: This is the normal script in which expressions in the language variety are written. Usually PanLem detects it correctly from the autoglossonym. The permitted script names are those recognized by ISO 15924.

More information about the properties of language varieties is available in our documentation on the design of the database.

Recording a source’s language variety names

Most sources indicate the names of the language varieties they contain in some way. If you are using the acquisition management system, you can record the mappings between the source’s language variety name and PanLex UIDs, so that they can be stored in the database.

Mappings are recorded in square brackets following the appropriate UID in the language variety list. Within the square brackets you should put the UID of the language variety that the name is in, a colon delimiter, and the language variety name. For example, if the source documents spa-000 and refers to it as “Spanish” (its eng-000 name), you would put spa-000 [eng-000:Spanish]. If the source’s language variety name is an autoglossonym (is in the language variety itself), you may omit the UID and colon. For example, if the source documents spa-000 and refers to it as “español” (its spa-000 name), you can put spa-000 [español]. If the source refers to a language variety by more than one name, simply use more than one set of brackets, for example spa-000 [español] [eng-000:Spanish].

file—kind [formats]

This is a list of file types that the source includes. If the source is a printed book that will be donated to the Internet Archive, as is the case with most hard-copy printed books that PanLex purchases, its file type is mul@IA.

If you are using the acquisition management system, you can view the list of possible file types by opening PanLem, logging in by clicking the edit button, and clicking source: edit—fact—see. Under file—kind there is a list of file types, with brief explanatory text for some of them. If a source has multiple file types, you should separate them with a comma.

There are about 70 file types to choose among, and the correct type is sometimes unclear. If you encounter problems classifying a source’s file(s), discuss these with the PanLex staff. This can lead to the creation of new file types or the addition of better example strings to the display.

We register many sources with files in the Portable Document Format (PDF). For our purposes it is necessary to distinguish two main classes of PDF files when registering a source. Classify a PDF file as pdf-img if its pages are images of the pages of a printed publication or manuscript, such as those you would get if you took a photograph of each page. Classify it as pdf if it contains embedded text generated from the text of a file that the PDF file was based on, such as an HTML, Word, OpenOffice Writer, Excel, or plain-text file. Do not classify a PDF file as pdf if it contains text produced by a text-recognition program that was used on an image file. Such a program almost always makes errors, and we can run such a program ourselves on any image file as part of consultation. Classifying such a file as pdf could mislead editors into trying inappropriate analysis methods. You can usually see whether a PDF file’s text is original or recognized. When you select text on a page, recognized text is often selectable only erratically. Also, when you select, copy, and paste text, recognized text almost always contains errors, such as mistaking “rn” for “m”.

edit—person

This is a list of editors entitled to edit the source’s content with PanLem. You can grant editing permission to those who need it.

source—copy

This creates a new source identical to the current one, except for the label. If you have multiple sources to register that share many facts, this can save time.

file—submit

If you are analyzing a source and you need to register it before you can submit its final source file, this button takes you directly to file submission after the source is registered.

source—see

This changes the view of the source from editorial to display. The display view includes some information missing from the editorial view, including counts of the source’s meanings, definitions, and denotations.