Tabularization example | PanLex development

IntroductionUp

Let’s look at a brief illustration of one small piece of tabularization, so the concept of tabularization becomes clear.

File before tabularization

Here is a piece (one entry) of a source file before you tabularize it:

\lex wit
\ct witma
\alt witt
\ps v
\ge collect
\gn सोहोर्नु
\lg P
\ety
\ed NG/VR/SS
\dt 21/Aug/2006

Planning the work

You examine the information you have about this source: the source’s registration, this file, documentation files, and notes. Then you plan the structure of a tabular file.

In this case, you decide to create a tabular file with 4 columns:

Expression(s) in language variety pum-000 (Puma)
Grammatical class of the expression(s) in column 0
Expression(s) in language variety eng-000 (English)
Expression(s) in language variety npi-000 (Nepali)

Doing the analysis

Your next task is to analyze the source file so that a tabular file emerges. You have several PanLex tabularization tools to choose among as templates for your analysis. You choose one that fits this situation. Then you modify it to do what you need. After that, you test it on this file, debug it, inspect the tabular file that it outputs, revise the code as needed, and test again. When you are satisfied, you accept the tabular file that it produces as the final output of the tabularization stage. You are then ready to proceed to serialization.

File after tabularization

Your program generates a tabular file. One row of it, based on the entry shown above, looks like this:

wit+witma+witt     v      collect      सोहोर्नु

The wide spaces represent horizontal tab characters.

What did you do?

You used some of the information in the original file, and it appears in the tabular file. You disregarded some of the original file’s information. You reformatted what you used.

You had already decided on the format for the tabular file. The main further decisions you made were to:

Treat \lex, \ct, and \alt as marking synonymous expressions in pum-000.
Use + as a synonym delimiter in column 0.
Disregard the data marked with \lg, \ety, \ed, and \dt.

These, like all tabularization decisions, may be questioned. You may later change some of the decisions that you made. Your tabularization program stays with the other source files. If you wish to modify the program and reanalyze the source, you can do so.