IntroductionUp
Most PanLex content has been ingested from final source files created in the process of source analysis. If an editor of a source decides to improve the source’s data, the editor usually revises the scripts that generated the final source file and re-executes them, producing a new final source file, which is then submitted as a replacement for the existing data of the source.
However, some improvements to the quality of PanLex data do not seem to be practical if they must be made via reingestion of final source files. This is true for quality defects that appear in many sources. To remedy these, it is more efficient to apply rules directly to the data in the database.
For example, suppose that we wish to extract expressions from all definitions in English that consist of a verb followed by “somebody” when their meanings don’t yet have denotations with those verbs as expressions. One action can do this for data from hundreds of sources at once, while doing it one by one in all the sources would be inefficient.
Efficiency in content curation is important, because there are many candidates for curation among the facts in the database. For example, different expressions in the same language variety that share a degraded form merit consideration for consolidation. As of December 02015, there were about a million such sets, and about 160,000 in English.
Caveat
Although content curation can significantly benefit the quality of PanLex data, it also creates a major problem to solve: durability. For data from most sources, durability is not a problem, because most sources have been published once and are unlikely ever to be revised. But many sources, such as Wiktionaries and Freelang dictionaries, are revised continuously, periodically, or occasionally. If we perform content curation, it may modify data from those sources. Then, when they are revised and we reanalyze them, the revised data replace the previous data, perhaps destroying the curated improvements.
We then have the problem of making durable content curation possible, despite the replacive reingestion of revised sources.
Solutions
One solution concept is normalization-mediated durability. It is the effect of content curation on future source analysis by means of the normalize
and normalizedf
serialization scripts. If this worked perfectly, it would be unnecessary to worry about durability, because curated changes in content would be self-replicating even in the ingestion of data from revised sources. If, for example, we merged the expression “decision maker” into the expression “decision-maker”, then, in future analyses of sources containing “decision maker”, the normalize
script would convert it to “decision-maker”.
Another solution concept is repetitive curation. It is a practice of writing routines to perform curation rather than curating editorially. A library of curation routines could be executed periodically or after triggering events (such as replacive ingestion of a source), to reinstate improvements that might have been lost.
Normalization cannot mediate the durability of some curated improvements, unless the normalization scripts are amended to cover those improvements. As of 02015, for example, a curated change of the character “!
” (exclamation mark) to “ǃ
” (Latin letter retroflex click) in some contexts would not self-replicate, because otherwise identical words containing these two characters would not have the same degraded form. Likewise, no normalization script would replicate a correction of an expression’s language variety.
How?
PanLex does not yet offer an interface useful for comprehensive content curation, nor any built-in guarantee of durability.
Some content curation can be performed with the PanLem interface, but any sophisticated conditional rules must be applied with an SQL client, such as psql.
A few user-defined procedures in the database support particular curation actions in an SQL client. Among them are exlvmd
(integer, integer, text, text) and dnlvmd
(integer, integer, text, text). They permit you to change the language variety of all expressions in a particular language variety that are in a particular script, except when doing so would violate a uniqueness constraint.