Procuring digital data | PanLex development

IntroductionUp

Most acquired sources up to now have been digital, namely files stored in computer-tractable media. Their tractability is not always any better than that of printed works, because in some cases they are merely pictures of the pages of printed works. But procuring them can still be instantaneous.

Simple cases

You can procure some digital sources easily by downloading them from websites. A digital source is often a single web page that can be saved to your local storage drive.

Even then, the translations are often on one page and the metadata about the source are on one or more other pages. So you should look for the pages that give information such as author, title, publication date, licensing claims, institutional sponsorship, abbreviations, content formats, languages being documented, and usage instructions. When you find such pages, you should save them, too.

Complex cases

Other digital sources are not so easy to procure and may require advanced methods.

If a source consists of many web pages, you may need to use tools like cURL or wget. A website may hide its data from you until you register as a user of it, apply for access permission, and/or write a program to submit queries. If any of these methods are beyond your capabilities and it isn’t practical for you to learn how to use them, you can note the requirements in your discovery documentation and leave the procurement for somebody else to conduct. The acquisition management system provides a way to record such notes.

Multiple-file procurement example

Here is an example of how to procure a multiple-file source.

Source

We procured the source that is now labeled eng-deu-fra:ISOcat as 6,185 files in dcif format. The files were all available at the URL http://www.isocat.org/rest/dc/.

Procurement with wget

To procure the files, we used the open-source wget program on a local computer. We invoked that program in such a way as to retrieve all and only the files that we wanted. First we launched a terminal program. Then we changed our working directory to a directory where we wanted the files to be written. Then we issued the following command:

wget -rl1 -A dcif http://www.isocat.org/rest/dc/

This command ran wget and told it to obtain files recursively (-r) with a depth limit of 1 (-l1) from the specified URL and accept only files whose names had the extension dcif (-A dcif). A depth limit of 1 meant that all qualifying files to which there was a link on the page with the specified URL as its address should be retrieved.

Procurement with cURL

An alternative method would have been to use cURL, rather than wget. We could get the files with cURL by saving a copy of the html file from that URL, opening it in a text editor, and editing it to leave out all the filenames except the .dcif ones and then further edit the file to put the filenames into this format:

…
url=http://www.isocat.org/rest/dc/2318.dcif
-O
url=http://www.isocat.org/rest/dc/2319.dcif
-O
url=http://www.isocat.org/rest/dc/2320.dcif
-O
…

If we named our file isocaturl.txt, the cURL command would be curl -K isocaturl.txt. It would tell cURL to get its parameters from that file. Each pair of lines in the file would tell curl to get the file at a particular URL and save it under the same name (-O).

We could use cURL more simply, instead, by just issuing the command:

curl -O http://www.isocat.org/rest/dc/[1-6872].dcif

This would cover the range of filenames, from 1.dcif to 6872.dcif, but there are in fact only 6,185 files. The 687 missing files requested by this cURL command would be retrieved anyway, containing all the html content of a no-such-file error page. It would be necessary to delete or disregard these later.

If we wanted to collect all the files into a single file rather than keep them in many separate files, we could easily do that. We would change the above command to:

curl http://www.isocat.org/rest/dc/[1-6872].dcif > eng-deu-fra-ISOcat-0.html

This would concatenate all the downloaded files into eng-deu-fra-ISOcat-0.html in whatever directory is the current directory of our terminal session.