Classification and property extensions

IntroductionUp

We added classifications and properties to the database design in 02015. They qualify meanings and denotations. They do not qualify definitions, languages, language varieties, characters, scripts, sources, or users. Should they? Some proposals for the extension of classifications and properties have been made, but not yet implemented.

Sources

Starting in 02015 we have discussed the idea of extending classifications and properties to sources. There have been two main motivations: making it easier to record facts about sources, and supporting linkage between PanLex source data and external bibliographic data and standards, such as OCLC, Library of Congress (possibly under the Z39.50 standard), and Open Library.

Pool’s proposal 1

In 02016 Jonathan Pool proposed creating a new table for source classifications and a new table for source properties. He proposed that the existing source attribute li (license) be converted to a source classification, with art-301:license as the superclass expression and expressions from art-298 and art-299 as the class expressions, and the other optional attrbutes of sources be converted to source properties. His proposed app table differed from the existing mpp and dpp tables by including an sq column to identify the ordinal index of a property. This column would record the order of multiple like properties, such as authors. The attributes remaining in the ap table would be ID, registration date, estimated quality, and group ID.

Table app would absorb columns ur, bn, au, ti, pb, yr, ul, ip, co, and ad of table ap (sources). Their ex values would be the IDs of equivalent expressions in the DCMI Metadata Terms language variety, art-301. or the PanLex Source Classes and Properties language variety, art-305.

The sq column of table app would identify the sequential index of each property, permitting properties of the same source with the same attribute (e.g., multiple authors) to be ordered.

Kamholz’s proposal 2

In June 02016 David Kamholz proposed creating a PanLex-sponsored “meta-source” (art-PanLex, ID 0) whose meanings would belong to sources. When the meta-source equipped a meaning with a classification, property, definition, or denotation, that detail would be associatable with a particular source, namely the source owning that meaning. The ownership of a meaning by a source would appear in the ap table, in a new mn column, constrained to reference a meaning. The same method would be applied to any other table (such as lv) that we wanted to make eligible for classifications and properties.

Restrictions on the meta-source would be introduced, so that its meanings would be protected.

In the conversion of multivalued fields, such as ap.au (author), to meaning properties, the existing dilimited strings, containing pseudo-lists in the desired orders, would be preserved.

References to tables and columns of tables that the new classifications and properties replaced would be amended to stop using those tables and columns. Thereafter, those tables and columns would be dropped. Kamholz did not specify them, but for sources they might be tables aped, af, fmapli, and av, and table ap’s columns ur, bn, au, ti, pb, yr, ul, li, ip, co, and ad.

Pool’s proposal 2

In October 02016 Jonathan Pool proposed a variant of Kamholz’s proposal 1 involving distributed control over source descriptions. It would be implemented with the following steps:

  1. Create an immutable language variety art-nnn with default name “PanLex Source Labels”.
  2. Generalize the title of source 0 from “PanLex meta-source” to “PanLex Lexical Translations” and revise its label from art:PanLex to art-mul:Panlex. (It will translate source labels into titles in various language varieties.)
  3. For each existing source, do the following:
    1. Create a meaning of art:PanLex and give it a denotation, whose expression is in art-nnn and has a text identical to the source’s label (i.e. the value of column ap.tt).
    2. Create a property of that meaning with attribute art-317:superseded and a value identical to the source’s ID (column ap.ap).
    3. Change the value in column ap.ap to the ID of the expression of the denotation of that meaning.
  4. Making use of the meanings created in the previous step, duplicate the values in the following columns and tables as follows:
    • ap except columns ap, dt, ex, yruq, and ui
      • ur: meaning properties.
      • bn: meaning properties.
      • au: meaning properties and/or meaning classifications.
      • ti: translations, meaning classifications, and/or meaning properties.
      • pb: meaning properties and/or meaning classifications.
      • ul: meaning properties and/or meaning classifications.
      • li: meaning properties and/or meaning classifications.
      • ip: meaning properties.
      • co: meaning properties and/or meaning classifications.
      • ad: meaning properties.
    • aped except columns cx, dn, and dnauto
      • q: meaning classifications.
      • im: meaning classifications.
      • fp: meaning properties.
      • etc: meaning properties and/or meaning classifications.
    • af: meaning classifications.
    • fm: meaning classifications.
    • apli: meaning classifications.
    • av: meaning classifications.
  5. Amend all subroutines, functions, and references that interrogate or modify the duplicated values to act on their new versions.
  6. Add a constraint to column ap.ap requiring its values to be values of column ex.ex.
  7. Drop the duplicated columns and tables.

Whenever a decision is made to convert any fact about a source to a class expression, it would be necessary to choose a language variety for it. This  would be straightforward for the boolean values aped.q and aped.im, the language varieties in av.lv, and the closed classes of values of ap.li/apli.li and af.fm/fm.fm, but require nontrivial work for the other values, i.e. authors (ap.au), titles (ap.ti), publishers (ap.pb), right-holders (ap.co), and miscellaneous facts (ap.ul and aped.etc). Editors could avoid this work by converting facts to property values instead of class expressions.

The new version of column ap.ti could make individual titles deemed titles of sources be translations of one another and of source labels (cf. “Shorter Oxford English Dictionary” = “Малый оксфордский английский словарь”, “Silent Spring” = “אביב דומם”). If a title is that of a resource that is deemed to contain a source but not to be the source, the title would be the class expression of a meaning classification or the value of a meaning property instead.

Where permitted, meaning classifications could be used for values existing as expressions (e.g., “Harvard University Press”). For other values, editors could use them as values of meaning properties, or could make them expressions and then use them as class expressions.

For any pseudolist values (including authors, titles, and publishers), in addition to the conversion of the individual pseudolist elements, meaning properties could be used for the conversion of the pseudolists themselves. These properties would preserve the order of the pseudolist elements. Being property values, they would not have language varieties, and they should not, because they can be composed of elements in diverse language varieties.

After the PanLex-sponsored conversion, all sources would, as usual, be entitled to assign their own meanings to the resulting denotation and class expressions. This could produce additional facts about sources, usable by interfaces at their discretion.

Tables aru, au, and aur would be retained.

Pool’s proposal 3

In October 02016 Jonathan Pool proposed a variant of Pool’s proposal 2 involving the incorporation of sequentiality directly into classifications and properties. This variant was based on a suggestion from David Kamholz. It would be implemented with the following steps:

  1. Add a column sq to each of tables mcs and mpp with type smallint. (This departs from Pool’s proposal 2.)
  2. Create an immutable language variety art-nnn with default name “PanLex Source Labels”.
  3. Generalize the title of source 0 from “PanLex meta-source” to “PanLex Lexical Translations” and revise its label from art:PanLex to art-mul:Panlex. (It will, among other things, translate source labels into titles in various language varieties.)
  4. For each existing source, do the following:
    1. Create a meaning of art:PanLex and give it a denotation, whose expression is in art-nnn and has a text identical to the source’s label (i.e. the value of column ap.tt).
    2. Create a property of that meaning with attribute art-317:superseded and a value identical to the source’s ID (column ap.ap).
    3. Change the value in column ap.ap to the ID of the expression of the denotation of that meaning.
  5. Making use of the meanings created in the previous step, duplicate the values in the following columns and tables as follows, using column sq to preserve the order in the case of any pseudolist value that you split into values of distinct classifications and/or properties:
    • ap except columns ap, dt, ex, yruq, and ui
      • ur: meaning properties.
      • bn: meaning properties.
      • au: meaning properties and/or meaning classifications.
      • ti: translations, meaning classifications, and/or meaning properties.
      • pb: meaning properties and/or meaning classifications.
      • ul: meaning properties and/or meaning classifications.
      • li: meaning properties and/or meaning classifications.
      • ip: meaning properties.
      • co: meaning properties and/or meaning classifications.
      • ad: meaning properties.
    • aped except columns cx, dn, and dnauto
      • q: meaning classifications.
      • im: meaning classifications.
      • fp: meaning properties.
      • etc: meaning properties and/or meaning classifications.
    • af: meaning classifications.
    • fm: meaning classifications.
    • apli: meaning classifications.
    • av: meaning classifications.
  6. Amend all subroutines, functions, and references that interrogate or modify the duplicated values to act on their new versions.
  7. Add a constraint to column ap.ap requiring its values to be values of column ex.ex.
  8. Drop the duplicated columns and tables.

Whenever a decision is made to convert any fact about a source to a class expression, it would be necessary to choose a language variety for it. This  would be straightforward for the boolean values aped.q and aped.im, the language varieties in av.lv, and the closed classes of values of ap.li/apli.li and af.fm/fm.fm, but require nontrivial work for the other values, i.e. authors (ap.au), titles (ap.ti), publishers (ap.pb), right-holders (ap.co), and miscellaneous facts (ap.ul and aped.etc). Editors could avoid this work by converting facts to property values instead of class expressions.

The new version of column ap.ti could make individual titles deemed titles of sources be translations of one another and of source labels (cf. “Shorter Oxford English Dictionary” = “Малый оксфордский английский словарь”, “Silent Spring” = “אביב דומם”). If a title is that of a resource that is deemed to contain a source but not to be the source, the title would be the class expression of a meaning classification or the value of a meaning property instead.

Where permitted, meaning classifications could be used for values existing as expressions (e.g., “Harvard University Press”). For other values, editors could use them as values of meaning properties, or could make them expressions and then use them as class expressions.

After the PanLex-sponsored conversion, all sources would, as usual, be entitled to assign their own meanings to the resulting denotation and class expressions. This could produce additional facts about sources, usable by interfaces at their discretion.

Tables aru, au, and aur would be retained.

The sq columns would preserve order across the boundary between classifications and properties. Serial IDs could not perform that function. The sq columns could also find uses in other contexts, such as permitting class and value sequences for complex classifications and properties (e.g., specifying both causees and results of complex causatives).

Kamholz’s proposal 2

In October 02016, Kamholz introduced a new proposal responding to Pool’s proposal 3. The goal was to revise Pool’s proposal 3 to reduce the amount of disruption and engineering required and remedy certain defects, while maintaining the advantages of extending classifications and properties to sources and other database objects.

The proposal involves the following revisions to Pool’s proposal 3:

  1. The sq column in mcs and mpp should begin sequenced items at 1, in conformance with SQL norms.
  2. Keep the title of source 0 as “PanLex meta-source”, since not all meanings will necessarily contain translations, and the name “PanLex Lexical Translations” can too easily be misunderstood. The name “meta-source” may be opaque, but at least it doesn’t sound like something it is not. (The label of source 0 could, perhaps, still be made art-mul:Panlex, since some meanings may contain translations, while others will not. This is a minor detail.)
  3. Convert columns into classifications and properties only when there is a compelling reason to do so. Such reasons may include: (1) the column being optional or sometimes containing more than one value, (2) the column not being central to the database schema (for example, it is not a foreign key and no foreign key references it), (3) the column not being frequently referenced in existing code, (4) the column representing one of a potentially open-ended set of attributes that we may want to flexibly extend in the future, (5) the column containing values from a standard or other ontology that we can leverage with expressions in the database. These are judgment calls, and not always obvious.
  4. When a column is converted to classifications or properties, do so in a way that is minimally disruptive to existing interfaces (mainly PanLem and the PanLex bot). Make the column convertible to the new classification or property in a one-to-one manner with no edge cases. Specify conventions for serializing and deserializing items with multiple values. For example, alternative titles could be joined with “ ~ ” into a single PanLem text field. The field could then be deserialized by splitting on the same string.
  5. Applying the above guidelines, make the following additions, changes, and retentions to tables:
    • ap
      • ap: keep as is, since removing it (1) creates concurrency issues in the database, (2) requires various additional database changes (in foreign keys, stored functions, and other code) which are likely to be tricky to implement and debug, (3) adds complexity with the art-317:superseded compatibility layer, (4) does not create any compelling benefits in return. Instead, add ex and mn columns that reference the source’s art-nnn expression and source 0 meaning (see below).
      • dt: keep as is.
      • tt: keep as is, for convenience (but synchronize with the associated ex text).
      • ur: replace with meaning property.
      • bn: replace with meaning property.
      • au: replace with meaning property.
      • ti: replace with meaning property. Some titles will be translated, but they aren’t prototypical PanLex expressions, and recording them as translations or a combination of translations and classifications creates unnecessary complexity and means there might be no one-to-one conversion procedure. By not using classifications or translations, we lose the ability to record the language variety of titles. On the other hand, recording it would require a fair amount of re-engineering of PanLem and the PanLex bot, and it’s not clear the benefit would be worth the cost. It makes more sense to convert to properties first, and later reconsider if we really need to record the language variety of titles.
      • pb: replace with meaning property.
      • yr: replace with meaning property. (Not all years are purely numeric.)
      • uq: keep as is.
      • ui: keep as is.
      • ul: replace with meaning property.
      • li: replace with meaning classification.
      • ip: replace with meaning property.
      • co: replace with meaning property.
      • ad: replace with meaning property.
      • ex: new column referencing art-nnn id.
      • mn: new column referencing source 0 meaning.
    • aped
      • ap: keep as is.
      • q: keep as is. Implementing boolean flags as classifications creates unnecessary complexity. The number of such flags is unlikely to grow very much over time, and in any case the current boolean flags are PanLex-specific and we are not likely to find good matches in ontologies for them.
      • cx: keep as is.
      • im: keep as is (same argument as for q).
      • fp: keep as is. This column is central to our workflow and I see no benefit in converting it to a property.
      • etc: replace with meaning property.
      • dn: keep as is. This column is central to our workflow and I see no benefit of converting it to a property.
      • dnauto: keep as is. This column is central to our workflow and I see no benefit of converting it to a property.
    • fm: could be replaced with meaning classifications, but if so, will the md column be left out? If not, how will it be represented, and in which source?
    • af: if fm is replaced with meaning classifications, replace with meaning classifications as well.
    • apli: replace with meaning classifications.
    • av: there are reasonable arguments for keeping as is or for replacing with meaning classifications. The pros and cons should probably be discussed further.
    • aru: keep as is.
    • au: keep as is. Add constraint prohibiting meaning editorship on source 0.
    • aur: keep as is.
  6. It may be worth considering whether to merge the ap and aped tables.
  7. It may be worth considering (ideally at some future point, following the extension of classifications and properties) whether to add an optional language variety attribute to properties. If we do so, definitions would become a kind of meaning property.