Most acquired sources up to now have been digital, namely files stored in computer-tractable media. Their tractability is not always any better than that of printed works, because in some cases they are merely pictures of the pages of printed works. But procuring them can still be instantaneous.
You can procure some digital sources easily by downloading them from websites. A digital source is often a single web page that can be saved to your local storage drive.
Even then, the translations are often on one page and the metadata about the source are on one or more other pages. So you should look for the pages that give information such as author, title, publication date, licensing claims, institutional sponsorship, abbreviations, content formats, languages being documented, and usage instructions. When you find such pages, you should save them, too.
Other digital sources are not so easy to procure and may require advanced methods.
If a source consists of many web pages, you may need to use tools like
wget. A website may hide its data from you until you register as a user of it, apply for access permission, and/or write a program to submit queries. If any of these methods are beyond your capabilities and it isn’t practical for you to learn how to use them, you can note the requirements in your discovery documentation and leave the procurement for somebody else to conduct. The acquisition management system provides a way to record such notes.
Multiple-file procurement example
Here is an example of how to procure a multiple-file source.
We procured the source that is now labeled
eng-deu-fra:ISOcat as 6,185 files in
dcif format. The files were all available at the URL
Procurement with wget
To procure the files, we used the open-source
wget program on a local computer. We invoked that program in such a way as to retrieve all and only the files that we wanted. First we launched a terminal program. Then we changed our working directory to a directory where we wanted the files to be written. Then we issued the following command:
wget -rl1 -A dcif http://www.isocat.org/rest/dc/
This command ran
wget and told it to obtain files recursively (
-r) with a depth limit of 1 (
-l1) from the specified URL and accept only files whose names had the extension
-A dcif). A depth limit of 1 meant that all qualifying files to which there was a link on the page with the specified URL as its address should be retrieved.
Procurement with cURL
An alternative method would have been to use
cURL, rather than
wget. We could get the files with
cURL by saving a copy of the
html file from that URL, opening it in a text editor, and editing it to leave out all the filenames except the
.dcif ones and then further edit the file to put the filenames into this format:
… url=http://www.isocat.org/rest/dc/2318.dcif -O url=http://www.isocat.org/rest/dc/2319.dcif -O url=http://www.isocat.org/rest/dc/2320.dcif -O …
If we named our file
cURL command would be
curl -K isocaturl.txt. It would tell
cURL to get its parameters from that file. Each pair of lines in the file would tell curl to get the file at a particular URL and save it under the same name (
We could use
cURL more simply, instead, by just issuing the command:
curl -O http://www.isocat.org/rest/dc/[1-6872].dcif
This would cover the range of filenames, from
6872.dcif, but there are in fact only 6,185 files. The 687 missing files requested by this
cURL command would be retrieved anyway, containing all the
html content of a no-such-file error page. It would be necessary to delete or disregard these later.
If we wanted to collect all the files into a single file rather than keep them in many separate files, we could easily do that. We would change the above command to:
curl http://www.isocat.org/rest/dc/[1-6872].dcif > eng-deu-fra-ISOcat-0.html
This would concatenate all the downloaded files into
eng-deu-fra-ISOcat-0.html in whatever directory is the current directory of our terminal session.