Doing discovery

IntroductionUp

After planning your discovery, how do you execute your plan? This page offers some suggestions.

Don’t just think “dictionary”

Data useful for PanLex often appear in bilingual and multilingual dictionaries, but dictionaries are not the only things that attest such data.

Useful data may appear in (or be produced by) resources of several kinds. These include:

  • publications
  • manuscripts
  • APIs
  • dynamic websites
  • other Internet-based services
  • text-mining programs
  • human minds

Among publications, it is tempting to presume that every resource must be called a dictionary, but in fact publications attesting lexical translations can have several different names. Among them (and these are only those in English) are:

  • dictionary
  • wiktionary
  • glossary
  • vocabulary
  • vocabulary database
  • lexical database
  • word list
  • lexicon
  • thesaurus
  • wordnet
  • standard

You may find lexical translations occupying an entire publication or only a part. Often a comprehensive work about a language, such as a sociolinguistic survey or a grammatical description, has a bilingual vocabulary in it as a chapter or appendix.

Don’t just think “PDF”

If you were assuming that dictionaries are the only sources to look for, you might also be assuming that digital ones are always PDF files. That, too, would be an oversimplification.

Here is a more adequate, though still incomplete, list of the formats in which digital files of documentary sources can be found:

  • Plain-text files (.txt)
  • Spreadsheets (.xls, .xlsx, .ods)
  • Files containing text markup (.html, .xml)
  • Database dumps in text format (.csv, .tsv)
  • Database dumps in RDF format (.ttl, .rdf, .nt, and others)
  • Database dumps in SQL format (.sql)
  • Rich-text files produced by word processors (.rtf, .doc, .docx, .odt)
  • Page-layout files with embedded text (.pdf)
  • Page-layout files containing one graphic image of each page (.pdf)
  • Graphic image files (.png, .tiff, .jpg, .bmp, and many others)
  • Printed publications
  • Typewritten manuscripts
  • Handwritten manuscripts
  • Sound files (.mp3, .wav, and many others)
  • Audiovisual files (.mp4, .avi, and many others)

Some resources have been packaged in multiple formats. When this is the case, we usually try to acquire a copy in each format, rather than trying to judge which single format will be most useful.

Generally, the formats listed earlier in the above list tend to be more tractable (easy to analyze) than those listed later.

If you have reason to believe that a source exists in a more tractable format than the ones available to the public, you can inquire as to whether this is true and, if so, whether we can obtain a copy for use in PanLex. You should confer with the Source Acquisition Specialist about the etiquette for such inquiries before you begin contacting authors or publishers.

Cost

When you look for data, it is useful to consider the cost, not only for procuring the data but for assimilating them, too.

Converting data from paper or image to text adds to the cost that we would pay if a resource’s format were already in a text format. Here are some common types of format conversion:

  • Re-encoding: the conversion of a text file’s character encoding to UTF-8. If the file has a standard encoding now, then its conversion to UTF-8 is probably easy. But it may be a rare or undocumented encoding, requiring careful inspection in order for us to determine what Unicode character corresponds to each distinct character in the file. A file could easily have a hundred distinct characters, and this might entail that many re-encoding decisions.
  • Image-to-text: the conversion of a graphic image of a page to a body of digital text. This is performed by layout-recognition and character-recognition software designed for what is usually called optical character recognition (OCR). It is safest to assume that a graphic image file is convertible to digital text only if the characters are all in the Latin script or all in the Cyrillic script, they are free of diacritical marks, the print is sharp, and the characters are separated by white space. Conversion might work under other conditions, but it often is too erroneous to be worth doing.
  • Human entry: the à-la-carte conversion of the format by a human editor, who reads the data and enters (by a keyboard or other input device) the converted data, mentally applying appropriate rules.

Where to look

The web is our main platform for discovery. General-purpose search engines are very helpful, even though, as pointed out above, there are many possible query terms to search with.

But much work has already been done in the identification of sources of lexical data, so a general search may not be the best way to start. You should review the list of useful acquisition websites and learn how to use them.

You are ultimately looking for a place to acquire data from. The most common places have been the web in general, web-based book archives, archives of linguistic data, publishers of books, resellers of books, and libraries. The methods of acquisition vary among these. Typically, we:

  • Download data from the web in general.
  • Download data from web-based book archives.
  • Download data from archives of linguistic data.
  • Purchase books from book publishers and resellers.
  • Purchase photocopies of selected pages from libraries.

How to look

Search types

You can discover sources, among other ways, by:

  • Internet searches: use of general search engines and specialized websites mentioned above.
  • Library searches: use of library catalogs’ and aggregators’ search tools, which permit you to search for books held in libraries by subject headings and titles.
  • Bookseller searches: use of book publishers’, resellers’, and aggregators’ catalogs, which let you search for books for sale by subject headings and titles.

General search methods

Another systematic method is to perform web searches with queries that are unlikely to match documents other tha sources suitable for PanLex. Gintare Grigonyte and Timothy Baldwin published an 02014 paper, “Automatic Detection of Multilingual Dictionaries on the Web”, describing such a method and reporting on a test of it.

Library search methods

In order to search effectively in library catalog metadata, you should familiarize yourself with the formats and standard terms in subject headings. You may find, for example, that a Spanish–Guarani dictionary has the subject heading “dictionaries—Guarani—Spanish”. You may then need to read the search system’s documentation or experiment with query formats to discover formats that match such headings. We have found that in some systems “dictionaries Guarani” works for a heading like that, but different systems may have different rules.

Library catalogs can be in various languages, so you may need to determine the right query terms in a given catalog language, too.