IntroductionUp
This tutorial will show you how to parse an HTML file with cheerio and node.js. cheerio lets you select HTML elements with a syntax similar to jQuery.
1. Install node.js and cheerio
The easiest way to install node.js on Unix environment, including OS X, is to use a package manager (for OS X, try homebrew) or the nvm script. On Windows, the best option is to download the installer from the node.js homepage. Make sure you also have npm (node package manager); it sometimes packaged separately from node.js.
Once you have installed node.js, you will need to install the cheerio module. The command to install cheerio is:
npm install -g cheerio
The -g flag means to install the module globally; otherwise you must install it separately for every script. It may be necessary to run npm with root permissions (e.g., with sudo) if you do not have write access to the global node_modules directory.
To be able to load the cheerio module, be sure to set the NODE_PATH environment variable to your global node_modules directory. If you are not sure where it is, check the output of the npm command above. To set NODE_PATH from bash, use a command such as:
export NODE_PATH=/usr/local/share/npm/lib/node_modules
You can place this command in your .bashrc or .bash_profile so that it runs every time you open a shell. On Windows, see this page for instructions on setting environment variables.
2. Download the sample source document
Download ady-fra-Batouka.zip, and unzip it. You should get a directory containing ady-fra-Batouka.html, an Adyghe-French dictionary.
3. Test your node.js installation
Open a text editor, and make a new file in the ady-fra-Batouka directory called ady-fra-Batouka.js. This is the node.js script that we will use to parse the HTML file. Copy the following lines into the file:
var cheerio = require('cheerio'); var fs = require('fs');
The require() method loads a module and returns an object. We installed cheerio above; fs is included with node.js. See the node.js docs for more on how modules are loaded and what standard modules are available.
Now open a command line, change to the ady-fra-Batouka directory, and run your script with the following command:
$ node ady-fra-Batouka.js
If all is well, there should be no output. If you get an error, make sure that node is in your PATH and that NODE_PATH is set correctly.
4. Load the HTML file into cheerio
The first thing we need to do is load ady-fra-Batouka.html with cheerio. Add the following code to ady-fra-Batouka.js:
var input = fs.readFileSync('ady-fra-Batouka.html', 'utf8'); var $ = cheerio.load(input);
The first line reads the HTML file into a variable; the second line loads it into cheerio and returns the cheerio object $. You may now manipulate $ in much the same way as you would with jQuery in the browser.
5. Parse the file
If you open ady-fra-Batouka.html in a browser and view the source, you will see that all Adyghe-French translations are inside the tag <tr class=”transcriptTable”>. The Adyghe word is inside this tag, in <td class=”word_form”>; the French translation is in <td class=”translation”>. To extract this data, we need to iterate over the relevant <tr> elements. Add the following code to ady-fra-Batouka.js:
var output = ''; $('tr.transcriptTable').each(function () { var ady = this.find('.word_form').text(); var fra = this.find('.translation').text(); output += ady + "\t" + fra + "\n"; });
The first line initializes an output variable, in which we will store the parsed data. The second line returns a cheerio object containing all elements of type tr whose class is transcriptTable. We then iterate over these objects with the each method, which calls the passed anonymous function once for each element, with the element’s value assigned to this.
The function body calls the cheerio find method twice for each tr element, extracting elements of class word_form and translation and returning their text content. (There is only one matching element of each class under the tr elements, so each find will return one element.) Finally, we append a new tab-delimited line to the output variable.
6. Save the results
Now, all that is left to do is to write the generated output to a text file, by adding the following code:
fs.writeFileSync('ady-fra-Batouka-0.txt', output, 'utf8');
That’s it! Try running the script again. You should get an output file ady-fra-Batouka-0.txt which begins as follows:
be beaucoup pe nez pe nez pˀe ta main pəte dur
For reference, here is the entire content of ady-fra-Batouka.js:
var cheerio = require('cheerio'); var fs = require('fs'); var input = fs.readFileSync('ady-fra-Batouka.html', 'utf8'); var $ = cheerio.load(input); var output = ''; $('tr.transcriptTable').each(function () { var ady = this.find('.word_form').text(); var fra = this.find('.translation').text(); output += ady + "\t" + fra + "\n"; }); fs.writeFileSync('ady-fra-Batouka-0.txt', output, 'utf8');
Tips on using cheerio
cheerio implements a significant subset of the jQuery API, but there are some important differences. The cheerio docs summarize the methods available. This section covers some differences that may not be immediately obvious.
- In jQuery, there is a distinction between jQuery objects, created with $, and DOM objects. In cheerio, there are no DOM objects, and the DOM API is not available. You must restrict yourself to the subset of the jQuery API implemented by cheerio. In practice, this often ends up being cleaner, as you don’t mix APIs and don’t have to worry about wrapping DOM objects with $.
- In jQuery, you can access DOM elements returned by a selector with array indices, e.g., $(‘a’)[0] will return the first <a> element in the document. In cheerio, this will simply return a single cheerio object. If you want more direct access to the raw HTML structure, you can access all of the attributes provided by the htmlparser module, which cheerio uses internally:
- the type attribute contains “text” for a text node and “tag” for an HTML tag (compare nodeType in the DOM)
- the data attribute contains the value of a text node (compare nodeValue in the DOM)
- the name attribute contains the tag name of a tag node (compare tagName in the DOM)
- The jQuery .size() method does not exist in cheerio. To get the number of found elements in a query, use the length attribute, e.g. $(‘a’).length.
Additional documentation
- Consult the cheerio documentation for a list of available methods on cheerio objects, i.e., methods available on $(‘some selector’). If a method is not listed in the cheerio documentation, it is not available in cheerio.
- Consult the jQuery documentation for a description of the cheerio/jQuery selector syntax. The cheerio documentation does not include this, since its syntax is identical to jQuery’s. A few jQuery selectors are not currently supported by cheerio, but the documentation does not specify which ones.
- Consult the MDN JavaScript reference and MDN JavaScript guide for a general overview of JavaScript.