Tabular files

IntroductionUp

The output from tabularization is a tabular file. That is a file having the format of a simple table: rows and columns. At the intersection of any row with any column, there is a cell. The specific requirements are described below.

A tabular file is ready for input into serialization, the next step in the workflow of assimilation.

Basic file format

In the instructions below, the term invariant means constant over the set of all rows. If “X” is invariant, then it is the same in every row.

A tabular file must have a format complying with these requirements:

  • Encoding: Unicode
  • Encoding form: UTF-8
  • Row delimiter: U+000A (<LF>)
  • Column delimiter: U+0009 (<TAB>)
  • Column count: invariant

The Unicode encoding and the UTF-8 encoding form are part of the basic PanLex text standards, too. So, a file that has been text-standardized will not need these format elements changed.

The “row delimiter” is also called a “line break”; it’s the character that ends each line in the file.

Since the column count is invariant, columns may be blank, but not missing.

Basic cell format

Each cell must contain a sequence of zero or more items of information. An item is a string of one or more characters, complying with the rules for the cell’s type. Between each pair of adjacent items must be an item delimiter: a string of one or more characters complying with the rules for the cell’s type.

Item delimiters can have types and are governed by a basic rule: Throughout any particular column, all item delimiters of a particular type must match one particular regular expression, and no other cell content may match that regular expression. Item delimiters are generally a single, invariant character.

If a cell may contain items that, in turn, contain two or more segments, then each segment must be a string of one or more characters, and between each pair of adjacent segments in a column must be a segment delimiter: a single character. All the segment delimiters within any particular item must be the same. It can vary from one item to another, but this is usually not useful or necessary.

Column and cell type

Each column has a type. Each cell in the column inherits the column type. There are six permitted column (and, therefore, cell) types. They are:

  • Definition
  • Meaning classification
  • Meaning property
  • Denotation
  • Denotation classification
  • Denotation property

We discuss elsewhere what these types are good for. Here we focus on the rules governing the formats of cells of each type.

Column-type rules

Item content

Denotation

The content is the text of an existing or new expression that is, or is to be created, in an invariant language variety. Each nonblank denotation item must contain 1 segment (the text of the expression). If the expression is to be created, the language variety must be mutable.

Example: アオカ

Definition

The content is the text of a definition that is to be classified as being in an invariant language variety. Each nonblank definition item must contain 1 segment (the text of the definition).

Example: выписывать (журнал или газету)

Unary classification

The content is 2 segments: (1) a UID and (2) the text of a class expression that is, or is to be created, in the language variety having that UID.

Example: art-302:CAUS

Binary classification

The content is 4 segments:

  1. a UID
  2. the text of a superclass expression that is, or is to be created, in the language variety having the UID of segment 1
  3. a UID
  4. the text of a class expression that is, or is to be created, in the language variety having the UID of segment 3

Segment 3 may be omitted if all class expressions in the column are in the same language variety (in which case it must be specified later during serialization as the column language variety).

Full example: art-300:MadeOf:eng-000:steel

Example with default class UID: art-300:MadeOf:steel

Property

The content is 3 segments:

  1. a UID
  2. the text of an attribute expression that is, or is to be created, in the language having that UID
  3. the text of a property (i.e. any string of one or more characters)

Example: art-301:identifier:23044

Prefix drop

The content of each cell in any classification or property column may omit one or more leading segments if those segments are invariant. You must then add the omitted leading segments back as “prefixes” during serialization.

Example of unary classification: CAUS (prefix: art-302:)

Example of binary classification: Noun (prefix: art-303:PartOfSpeechProperty:art-303:)

Example of property: 23044 (prefix: art-301:identifier:)

Column language variety

Each definition column and each denotation column must generally have one inherent language variety.

Each classification column may have one inherent language variety.

Column language varieties are specified as arguments to the out-full-0 serialization script.

Item count

The maximum item count of a definition cell is one. Cells of all other types may have unlimited item counts.

Item-delimiter types

Denotation columns may contain item delimiters of two types. They are named synonym delimiters and meaning delimiters.

Columns of all other types may contain item delimiters of only one type.

Column order

Denotation classifications and denotation properties are called details of their denotations. A denotation cell, immediately followed by all of its detail cells, with zero or more blank cells intermixed, is a denotation block.

All denotation cells and denotation detail cells must appear in denotation blocks. Thus, cells of other types, such as definition cells, may not interrupt a denotation block.

Except for the above requirement, cells may appear in any order.