Learn Regular Expressions

What are regular expressions?

Regular expressions provide a concise language for describing patterns in strings. They make finding information easier and more efficient. That is, you can often accomplish with a single regular expression what would otherwise take dozens of lines of code. And while you can often come a long way with the default strip() and replace() functions, they have their own set of limitations. For example, how do you extract emails or hashtags from a text? Or how do you strip away HTML tags from the web page source code? Regular expressions fill this void and are a powerful skill for any data scientist's toolkit!

Warning

At first sight, regular expressions can look daunting but don't be put off! In this building block, we break down the syntax into tangible pieces so that you can get up to speed step by step.

Code

Match Characters

The re Python library contains a variety of methods to identify, split, and replace strings. The findall() function returns all matching cases that satisfy a character pattern. Each pattern starts with r" followed by one or more symbols. Please note that "regular" characters can be chained together with these symbols. For example, r"\d\ds" refers to 2 digits (\d) followed by a lower case letter s. The equivalent in "R" is found in the library stringr by means of the str_match_all function. In Stata the function regexm can be employed. However, it only reports a value equal to 1 if the expression is true and 0 otherwise.

| Symbol | | | | | | ---------- | ----- | ----------- | ----------------------------------- | ------------------ | | Python | R | Stata | Definition | Example | | \d | \\d | [0-9] | digit | 0, 1... 9 | | \s | \\s | (a space) | whitespace | (a space) | | \w | \\w | [a-zA-Z] | single letter, number of underscore | a, 4, _ | | . | \. | . | any character | b, 9, !, (a space) |

Tip

Want to find all but a certain character? Use capital letters for the symbols instead. For instance, \\D in R, matches all but a single digit.

Warning

In "Stata" regular expressions are much less flexible. However, they are still widely used for more specific purposes. In this sense, we provide different examples of regex usage for Stata.

The four examples below illustrate how to combine these symbols to extract various characters from my_string. For Stata, we will extract different parts of a postal address.

python
R


import re
my_string = "The 80s music hits were much better than the 90s."

# a single digit: ['8', '0', '9', '0']
print(re.findall(r"\d", my_string))

# double digits: ['80', '90']
# hereafter we learn how to write this more concisely using quantifiers
print(re.findall(r"\d\d", my_string))

# combinations of 3 characters (even if it's not a complete word)
# ['The', '80s', 'mus', 'hit', 'wer', 'muc', 'bet', 'ter', 'tha', 'the', '90s']
print(re.findall(r"\w\w\w", my_string))

# combinations of 3 characters that start and end with a space (note: the first "The" is skipped!): [' 80s ', ' the ']
print(re.findall(r"\s\w\w\w\s", my_string))


library(stringr)
my_string = "The 80s music hits were much better than the 90s."

# a single digit: ['8', '0', '9', '0']
str_match_all(my_string, "\\d")

# double digits: ['80', '90']
# hereafter we learn how to write this more concisely using quantifiers
str_match_all(my_string, "\\d\\d")

# combinations of 3 characters (even if it's not a complete word)
# ['The', '80s', 'mus', 'hit', 'wer', 'muc', 'bet', 'ter', 'tha', 'the', '90s']
str_match_all(my_string, "\\w\\w\\w")

# combinations of 3 characters that start and end with a space (note: the first "The" is skipped!): [' 80s ', ' the ']
str_match_all(my_string, "\\s\\w\\w\\w\\s")

Tip

Need further information on Stata regular expressions? Check this page, this one or the Stata documentation.

Quantifiers

In the examples above, we explicitly formulated a pattern of 1, 2, or 3 characters but in many cases this is unknown. Then, quantifiers can offer some flexibility in defining a search pattern of 0, 1, or more occurences of a character. Note that the symbols always refer to the character preceding it. For example, r"\d+" means one or more digits. The following quantifiers work for all python, R and Stata.

| Symbol | Definition | Example | |:---- | :---- | :---- | | ? | zero or one | colou?r (you want to capture both color and colour) | | * | zero or more | \d\.\d* (a single digit followed by zero or more decimals - e.g., 5.34, 8., 3.1)| | + | one or more | #\w+ (e.g., Twitter hashtags)| | {n,m} | between n and m | \d{2}-\d{2}-\d{4} (dates e.g., 05-03-2021)|

python
R


import re
my_string = "The 80s music hits were much better than the 90s."

# all words: ['The', '80s', 'music', 'hits', 'were', 'much', 'better', 'than', 'the', '90s']
print(re.findall(r"\w+", my_string))

# one or more digits followed by a s: ['80s', '90s']
print(re.findall(r"\d+s", my_string))

# words that are preceded or followed by more than one whitespace (i.e., to identify unnecessary spaces)
print(re.findall(r"\s{2,}\w+\s{2,}", my_string))


library(stringr)
my_string = "The 80s music hits were much better than the 90s."

# all words: ['The', '80s', 'music', 'hits', 'were', 'much', 'better', 'than', 'the', '90s']
str_match_all(my_string, "\\w+")

# one or more digits followed by a s: ['80s', '90s']
str_match_all(my_string, "\\d+s")

# words that are preceded or followed by more than one whitespace (i.e., to identify unnecessary spaces)
str_match_all(my_string, "\\s{2,}\\w+\\s{2,}")

Alternates

| Symbol | Definition | Example | |:---- | :---- | :---- | | a|b|c | or | color|colour | | [abc] | one of | [$€] (a dollar or euro currency sign)| | [^abc] | anything but | [^q] (any character except q) | | [a-z] | range | [a-zA-Z]+ (i.e., one or more lower or upper case letters) or [0-9] (digits)|

python
R


my_string = "The 80s music hits were much better than the 90s."

# words that consist solely of lower case letters
# [' music', ' hits', ' were', ' much', ' better', ' than', ' the']
print(re.findall(r"\s[a-z]+", my_string))

# one or more letters followed by a space, one or more digits, and a letter s: ['The 80s', 'the 90s']
print(re.findall(r"[a-zA-Z]+\s\d+s", my_string))


my_string = "The 80s music hits were much better than the 90s."

# words that consist solely of lower case letters
# [' music', ' hits', ' were', ' much', ' better', ' than', ' the']
str_match_all(my_string, "\\s[a-z]+")

# one or more letters followed by a space, one or more digits, and a letter s: ['The 80s', 'the 90s']
str_match_all(my_string, "[a-zA-Z]+\\s\\d+s")

Groups

More often than not, you want to capture text elements and use it for follow-up analyses. However, not every element of a pattern may be necessary. Groups, denoted by parentheses (), indicate the pieces that should be kept, and each match is stored as a list of tuples.

python
R


import re
my_string = "Lara has 2 sisters who also study in Tilburg. Mehmet has 1 sister who was born last year. Sven has 19 cousins who are all older than him."

# store the person's name, the digit, and the number of relatives.
# [('Lara', '2', 'sisters'), ('Mehmet', '1', 'sister'), ('Sven', '19', 'cousins')]
family = re.findall(r"([a-zA-Z]+)\s\w+\s(\d+)\s(\w+)", my_string)

# next, you can reference elements like you're used to, for example:
print(family[0][2])  # gives 'sisters'


library(stringr)
my_string = "Lara has 2 sisters who also study in Tilburg. Mehmet has 1 sister who was born last year. Sven has 19 cousins who are all older than him."

# store the person's name, the digit, and the number of relatives.
# [('Lara', '2', 'sisters'), ('Mehmet', '1', 'sister'), ('Sven', '19', 'cousins')]
family = str_match_all(my_string, "([a-zA-Z]+)\\s\\w+\\s(\\d+)\\s(\\w+)")

# next, you can reference elements like you're used to, for example:
 print(family[[1]][[1,4]])  # gives 'sisters'

Text in Between Characters

Regular expressions can also be used to extract text in between characters. For instance, using the same string as before, if we want to extract the number of cousins that Sven has we can employ the following code.

python
R


import re
my_string = "Lara has 2 sisters who also study in Tilburg. Mehmet has 1 sister who was born last year. Sven has 19 cousins who are all older than him."

found = re.findall("Sven has(.+?)cousins", my_string)
print(found)


library(stringr)
my_string = "Lara has 2 sisters who also study in Tilburg. Mehmet has 1 sister who was born last year. Sven has 19 cousins who are all older than him."

found = str_match_all(my_string, "Sven has(.+?)cousins")
print(found[[1]][[1,2]])

Split & Replace Data

Regular expressions can also be used to split (re.split() in python and strsplit in R ) or replace (re.sub() in python and gsub in R) characters. While the built-in split() function can split on a single character (e.g., ;), it cannot deal with a multitude of values. The same holds for Python's replace() function. The same goes for the analogous R functions.

python
R


import re
my_string = 'Last year was our most profitable year thus far. Our year-on-year growth grew by 14% to $10B!'

# split on both "! and ".":
# ['Last year was our most profitable year thus far',' Our year-on-year growth grew by 14% to $10B','']
re.split(r"[!.]", my_string)

# hide confidential data:
# 'Last year was our most profitable year thus far. Our year-on-year growth grew by X% to $XB!'
re.sub(r"\d+", "X", my_string)


library(stringr)
my_string = 'Last year was our most profitable year thus far. Our year-on-year growth grew by 14% to $10B!'

# split on both "! and ".":
# ['Last year was our most profitable year thus far',' Our year-on-year growth grew by 14% to $10B','']
strsplit(my_string, "[!.]")

# hide confidential data:
# 'Last year was our most profitable year thus far. Our year-on-year growth grew by X% to $XB!'
gsub("\\d+", "X", my_string)

Web Scraping

In addition to Beautifulsoup, you can apply regular expressions to parse HTML source code. Say the source code of a webpage consists of a book title, description, and a table:

python
html_code = '''
A Light in the Attic
It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers.

    
        UPC a897fe39b1053632
    
    
        Price (incl. tax) Â£51.77
    
    
        Availability
        In stock (22 available)
    
    
        Number of reviews
        0
    

'''

UPC	a897fe39b1053632
Price (incl. tax)	Â£51.77
Availability	In stock (22 available)
Number of reviews	0

Then, we can easily capture the text between two tags, a part of a row, or a specific section of the source code. Since the HTML code is split across multiple lines, the regex code .+ does not work as expected: it only matches characters on the first line. If you print the html_code to the console, you also find that each line is separated by a newline separator (\n). As a workaround, you can use the following set [\s\S]+ to capture both spaces (\s) and non-spaces (\S = letters, digits, etc.). Note that in the examples below, we rely on groups () to only select the elements we are after.

python


# title: ['A Light in the Attic']
re.findall(r"<h2>(.+)</h2>", html_code)

# availability: ['22']
re.findall(r"(\d+) available", html_code)

# <h2> and <p> sections
re.findall(r"[\s\S]+</p>", html_code)

# table section
re.findall(r"<table[\s\S]*", html_code)

Advanced Use Cases

Greedy vs non-greedy
By default, regular expressions follow a greedy approach which means that they match as many characters as possible (i.e., returns the longest match found). Let´s have a look at an example to see what this means in practice. Say that we want to extract the contents of my_string and thus remove the HTML tags.

Therefore, we replace the two paragraph tags (<p> and </p>) with an empty string. Surprisingly, it returns an empty string (''), why is that? After all, we would expect to see: This is a paragraph enclosed by HTML tags..

It turns out that the > in <.+> refers to the </p> tag (instead of <p>). As a result, the entire sentence is replaced by an empty string! Fortunately, you can force the expression to match as few characters as needed (a.k.a. non-greedy or lazy approach) by adding a ? after the + symbol.

python
R


import re
my_string = '<p>This is a paragraph enclosed by HTML tags.</p>'

# greedy approach: ''
re.sub(r"<.+>", "", my_string)

# non-greedy approach: 'This is a paragraph enclosed by HTML tags.'
re.sub(r"<.+?>", "", my_string)


library(stringr)
my_string = '<p>This is a paragraph enclosed by HTML tags.</p>'

# greedy approach: ''
gsub("<.+>", "", my_string)

# non-greedy approach: 'This is a paragraph enclosed by HTML tags.'
gsub("<.+?>", "", my_string)

Learn Regular Expressions

What are regular expressions?

Code

Match Characters

Quantifiers

Alternates

Groups

Text in Between Characters

Split & Replace Data

Web Scraping

A Light in the Attic

Advanced Use Cases

See Also

Related Posts

Text Pre-processing in Python

Text Pre-processing in R

Personalized Cookies

Learn Regular Expressions

What are regular expressions?

Code

Match Characters

Quantifiers

Alternates

Groups

Text in Between Characters

Split & Replace Data

Web Scraping

A Light in the Attic

Advanced Use Cases

See Also

Related Posts

Text Pre-processing in Python

Text Pre-processing in R