Regular expressions | PanLex development

IntroductionUp

A regular expression (abbreviated as ‘regex’) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, i.e. “find” or “find and replace”-like operations. For example, regexes can be used to find all instances in a text of the word ‘the’ followed a word beginning with the letter ‘p’, say, or all occurrences of a word that can be spelled in different ways (such as ‘email’, ‘e-mail’, ‘Email’, ‘E-mail’, etc). In Perl and Python, functions can also be called to split a line of text based on a regular expression matching pattern, such as tab characters or commas, or something more complex.

There are many introductions to regular expressions available on the web, organized around the syntax of regex searches and applied to searching text files. A great resource to start with is Regular-Expressions.info.

Rather than regurgitate all this information here, we instead draw your attention to a few tips that might be particularly useful when extracting data from a PanLex source.

Metacharacters

\d matches a single character that is a digit
\w matches a “word character” (alphanumeric characters plus the underscore, not punctuation marks)
\s matches a white space character (includes tabs and line breaks)
. matches any character (including white space characters)

Repetition

The + operator matches one or more occurrences of the preceding character in the regex. The * operator matches zero or more occurrences of the preceding character in the regex. So, the regular expression \w+ will match all the words in the text (i.e. all the continuous strings of non-white space characters, excluding punctuation, regardless of capitalization).

Note that in general, these operators are greedy in that they will consume as much of the string as possible in its match. For example, say that your raw data was the line of HTML mark-up below and you wished to extract the information that was associated with the bold tags.

This is in <b>boldface</b> text.

You could write the simple regex <.+> to match the substring <b>boldface</b>, since the + operator will include as many . (i.e. wildcard) characters as possible.

Instead you can instruct the + operator to be lazy – i.e. to include as few characters as possible – by adding a ? after the operator. So, the regex <.+?> run against the same string will match the two substrings <b> and </b>.

Anchors

The anchors ^ and $ indicate the beginning and end of a string, respectively. So, the regular expression ^abc matches the pattern ‘abc‘ at the beginning of a string, and xyz$ matches the pattern ‘xyz’ at the end of a string.

Parentheses

Parentheses serve two functions in regular expressions. First, they are used to indicate the scope of the operators. For example, the regular expression co+ will match the substrings ‘co’, ‘coo’, ‘cooo’, etc, whereas the regex with parentheses (co)+ will match the substrings ‘co’, ‘coco’, ‘cococo’, etc.

Parentheses are also used to create capturing groups. Say that a raw source file contained a line of data as follows:

rojo, adj, red

In this case, we notice that commas are being used to separate three types of information in the source. A simple way to deal with this would be to split the line based on the presence of a comma. However, a potentially more useful method would be to write a regular expression that matches a string which includes three words separated by a comma and a space, as follows:

pattern = "(\w+), (\w+), (\w+)"

By placing parentheses around each of the words matched by the regular expression, capturing groups have been created. This means that the component parts can subsequently be referenced directly by an index. In this case, the matched string with index 1 would be ‘rojo‘, with index 2 would be ‘adj‘ and with index 3 would be ‘red‘. (Note: index 0 returns the entire string ‘rojo, adj, red’.) In some cases, this method may prove to be a more convenient way to parse an original raw data file than simply splitting the line of data based on the presence of a comma.

Using regular expressions in Perl and Python

These pages provide more details of the syntax of regular expressions in Perl and Python.

Practice

Here is a selection of simple exercises for you to hone your skills writing regular expressions. Test your regular expressions at Rubular or Regular Expression Tester. Your solutions should not match anything other than the specified strings.

Match all sequences of one or more spaces.
Match a string of text beginning and ending with parentheses.
Modify the regular expression in 2 so that it successively matches each block of parenthesized text in the string there (is) more than (one way) to do (it).
Match all sequences of whitespace at the beginning or end of the string, with one regular expression.
Match any common punctuation marks used in English. Use a character class with a list of punctuation characters.
Match the string if it doesn’t contain any common punctuation mark used in English.
Match the string if it is in the format of a PanLex language uniform identifier (e.g., eng-000 for language code eng, variety code 0).
Match the string if it is an initially capitalized single word.
Match the string if it can be parsed as an integer in base 10 (i.e., match 0, 5, 13, -57, etc.).
Modify the regular expression in 9 to include numbers with decimal places.
Match the string if it begins and ends with the same word.
Match all characters in the string preceding an exclamation mark, without matching the exclamation mark.
Match the string if it consists of a headword, an optional parenthetical word class abbreviation (match at least n., v., adj., and adv.), a translation, and an optional parenthetical note. For example: ramps (n.) Bärlauch (Allium ursinum).

You can find additional exercises at Regex One.