# Learn Regular Expressions

11 mins

## What are regular expressions?

Regular expressions provide a concise language for describing patterns in strings. They make finding information easier and more efficient. That is, you can often accomplish with a single regular expression what would otherwise take dozens of lines of code. And while you can often come a long way with the default `strip()` and `replace()` functions, they have their own set of limitations. For example, how do you extract emails or hashtags from a text? Or how do you strip away HTML tags from the web page source code? Regular expressions fill this void and are a powerful skill for any data scientist’s toolkit!

Warning

At first sight, regular expressions can look daunting but don’t be put off! In this building block, we break down the syntax into tangible pieces so that you can get up to speed step by step.

## Code

### Match Characters

The `re` Python library contains a variety of methods to identify, split, and replace strings. The `findall()` function returns all matching cases that satisfy a character pattern. Each pattern starts with `r"` followed by one or more symbols. Please note that “regular” characters can be chained together with these symbols. For example, `r"\d\ds"` refers to 2 digits (`\d`) followed by a lower case letter `s`. The equivalent in “R” is found in the library `stringr` by means of the `str_match_all` function. In Stata the function `regexm` can be employed. However, it only reports a value equal to 1 if the expression is true and 0 otherwise.

Symbol
Python R Stata Definition Example
`\d` `\\d` `[0-9]` digit 0, 1… 9
`\s` `\\s` `(a space)` whitespace (a space)
`\w` `\\w` `[a-zA-Z]` single letter, number of underscore a, 4, _
`.` `\.` `.` any character b, 9, !, (a space)
Tip

Want to find all but a certain character? Use capital letters for the symbols instead. For instance, `\\D` in R, matches all but a single digit.

Warning

In “Stata” regular expressions are much less flexible. However, they are still widely used for more specific purposes. In this sense, we provide different examples of regex usage for Stata.

The four examples below illustrate how to combine these symbols to extract various characters from `my_string`. For Stata, we will extract different parts of a postal address.

Tip

Need further information on Stata regular expressions? Check this page, this one or the Stata documentation.

### Quantifiers

In the examples above, we explicitly formulated a pattern of 1, 2, or 3 characters but in many cases this is unknown. Then, quantifiers can offer some flexibility in defining a search pattern of 0, 1, or more occurences of a character. Note that the symbols always refer to the character preceding it. For example, `r"\d+"` means one or more digits. The following quantifiers work for all python, R and Stata.

Symbol Definition Example
`?` zero or one `colou?r` (you want to capture both `color` and `colour`)
`*` zero or more `\d\.\d*` (a single digit followed by zero or more decimals - e.g., 5.34, 8., 3.1)
`+` one or more `#\w+` (e.g., Twitter hashtags)
`{n,m}` between n and m `\d{2}-\d{2}-\d{4}` (dates e.g., 05-03-2021)

### Alternates

Symbol Definition Example
`a|b|c` or `color|colour`
`[abc]` one of `[\$â‚¬]` (a dollar or euro currency sign)
`[^abc]` anything but `[^q]` (any character except q)
`[a-z]` range `[a-zA-Z]+` (i.e., one or more lower or upper case letters) or `[0-9]` (digits)

### Groups

More often than not, you want to capture text elements and use it for follow-up analyses. However, not every element of a pattern may be necessary. Groups, denoted by parentheses `()`, indicate the pieces that should be kept, and each match is stored as a list of tuples.

### Text in Between Characters

Regular expressions can also be used to extract text in between characters. For instance, using the same string as before, if we want to extract the number of cousins that Sven has we can employ the following code.

### Split & Replace Data

Regular expressions can also be used to split (`re.split()` in python and `strsplit` in R ) or replace (`re.sub()` in python and `gsub` in R) characters. While the built-in `split()` function can split on a single character (e.g., `;`), it cannot deal with a multitude of values. The same holds for Python’s `replace()` function. The same goes for the analogous R functions.

### Web Scraping

In addition to Beautifulsoup, you can apply regular expressions to parse HTML source code. Say the source code of a webpage consists of a book title, description, and a table:

``````html_code = '''
<h2>A Light in the Attic</h2>
<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers.</p>
<table class="table table-striped">
<tr>
<th>UPC</th><td>a897fe39b1053632</td>
</tr>
<tr>
<th>Price (incl. tax)</th><td>Ã‚Â£51.77</td>
</tr>
<tr>
<th>Availability</th>
<td>In stock (22 available)</td>
</tr>
<tr>
<th>Number of reviews</th>
<td>0</td>
</tr>
</table>
'''
``````

Then, we can easily capture the text between two tags, a part of a row, or a specific section of the source code. Since the HTML code is split across multiple lines, the regex code `.+` does not work as expected: it only matches characters on the first line. If you print the `html_code` to the console, you also find that each line is separated by a newline separator (\n). As a workaround, you can use the following set `[\s\S]+` to capture both spaces (\s) and non-spaces (\S = letters, digits, etc.). Note that in the examples below, we rely on groups `()` to only select the elements we are after.

## Advanced Use Cases

Greedy vs non-greedy
By default, regular expressions follow a greedy approach which means that they match as many characters as possible (i.e., returns the longest match found). LetÂ´s have a look at an example to see what this means in practice. Say that we want to extract the contents of `my_string` and thus remove the HTML tags.

Therefore, we replace the two paragraph tags (`<p>` and `</p>`) with an empty string. Surprisingly, it returns an empty string (`''`), why is that? After all, we would expect to see: `This is a paragraph enclosed by HTML tags.`.

It turns out that the `>` in `<.+>` refers to the `</p>` tag (instead of `<p>`). As a result, the entire sentence is replaced by an empty string! Fortunately, you can force the expression to match as few characters as needed (a.k.a. non-greedy or lazy approach) by adding a `?` after the `+` symbol.

## See Also

• As you may have figured out by now, formulating regular expressions is often a matter of trial and error. An online regex editor that interactively highlights the phrases your regular expression captures can therefore be extremely helpful.

• Frequent applications of regular expressions are extracting dates and emails (and checking for validity), parsing webpage source data, and natural language processing. This blog post demonstrates how you can implement these ideas.

• This building block deliberately only revolved around the most common regex operations, but there are many more symbols and variations. This tutorial provides a more comprehensive list of commands (incl. examples!).