CSc 250: Lecture Notes: Regex

Earlier in this semester, we learned about regular expressions. As we’ve already discussed, regular expressions are used to search and match patterns of characters in text. In particular, we primarily used regular expressions with the grep program in bash. In these notes, we’ll do some basic regular expression review, and then learn how to use it in python.

RegEx Review

As we know by now, regular expressions are simply strings that represent a patter to match in another string.

A regular expression string is composed of three types of characters:

If you feel in need of extra practice to get back up to speed with regex, go back through the RegexOne Tutorial.

For review, let’s iterate through a few examples on regexr.com, and in bash. For each of the below snippets of text, come up with a regex that matches the top section but not the bottom:

WE ARE GOING TO LEARN
THIS. IS. REGEX.
UPPER CASE EH?
-------
We would LIKE to learn
this. is. also. regex.
lower CASE EH
what is your Name?
one two three
SEVEN EIGHT nine
who are you?
is this real?
-------
why wait?
when can you stop by?
one two THREE FOUR five six
hello!
-1234
+45834959
-5672
-------
23952
not a number
+type
-number
www.gmail.com
www.espn.com
www.arizona.edu
benjdd.com
code.org
-------
we are farmers
the website ends with dot-com
google.au
gee mail dot com
world wide web
amazon.ca

RegEx in Python

To use regular expressions on a python program, we must import a module named re. As with sys and sqlite3, the re module is built-in to python by default, so there is no need to download or install anything separately.

The steps to use a regular expression in python are:

The first step is easy:

import re

Next, we need to come up with a regular expression to use. In this very simple demonstration, we’ll use a regular expression that matches a capitalized word that is two letters or longer:

regex_str = r'[A-Z][a-z]+'

In the next step, we’ll use the compile() function from the regex module to compile the regex so that it can be used to do searching and matching. Compiling a regex converts it from a string to a python object which has the logic to do matching.

pattern = re.compile(regex_str)

Let’s check the type of the pattern variable:

>>> type(pattern)
<class '_sre.SRE_Pattern'>

_sre.SRE_Pattern is a special type that is defined internally in the re module. It represents a particular regex pattern, which we specified with the regex_str string. We can now use the pattern variable to do searching and matching! Matching and searching can occur on other python strings.

We will discuss three different functions that the pattern variable has that we can call to check for the specified pattern:

findall() is (arguably) the simplest and easiest to understand of these, so we’ll focus on this function first.

A Simple findall() Example

Perhaps we have another variable named sentence in the program, declared in this way:

sentence = "The boy's father, George, lived in France for 17 years."

Let’s try using findall() to search for capitalized words in this sentence. To do this, we’ll use the same regex_str and pattern used in the last section. When called, findall() will return a list of all strings that it finds within the input string that match the pattern. So, what do you think this snippet of code will print (first think, then try!):

regex_str = '[A-Z][a-z]+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)
for item in result:
    print(item)

Run it to find out the right answer! Since a list is returned, we could also just print the list directly:

regex_str = '[A-Z][a-z]+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)
print(result)

Open up a python terminal session, and try using the same pattern to find matches in a few other strings:

sentence = "One TWO three Four FIVE six Seven EIGHT nine"
sentence = "It was the best of times, It was the worst of times."
sentence = "James, John, Sam, and Alex drove to the Tucson mall together."

Does it work the way you expect it to?

Why Use re?

Now that we’ve used the findall() function a few times, we can start to get a feel for how it works. However, let’s take a step back to consider an important question: Why use re/regular-expressions in the first place? Couldn’t we just write our own code to search for these things? Let’s take a look.

As a recap, this is the sequence of code necessary to search for upper-case words in a string:

import re
regex_str = '[A-Z][a-z]+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)

Assuming that the code-writer has some experience with regular expressions, this is rather simple code. In contrast, here is some code that accomplishes the same thing without using re (or any other module, for that matter):

result = []
i = 0
while i < len(sentence)-1:
    if sentence[i] >= 'A' and sentence[i] <= 'Z':
        if sentence[i+1] >= 'a' and sentence[i+1] <= 'z':
            j = i+1
            while sentence[j] >= 'a' and sentence[j] <= 'z':
                j += 1
            result.append(sentence[i:j])
    i += 1

Try running both of these, to be sure that they give the same results. Which of these is easier to understand? This is an arguable point, but I think most would agree that the former is much more clean and concise.

The usefulness of the re module because even more obvious when need to match multiple patterns throughout a program. Perhaps later in the same program, the program needs to match for another pattern, like this:

regex_str = r'\d+ \w+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)

In this case the pattern specifies a number, followed by a word. We only had to make minimal changes the code used before (only the regex_str had to change slightly, to support a new type of string). How can this match be accomplished without re? One way to do so:

result = []
i = 0
while i < len(sentence):
    nb = i
    if sentence[i].isdigit():
        while nb < len(sentence) and sentence[nb].isdigit():
            nb += 1
        if nb < len(sentence) and sentence[nb] is ' ':
            wb = nb + 1
            while wb < len(sentence) and sentence[wb].isalpha():
                wb += 1
            if wb < len(sentence) and sentence[wb] is ' ':
                result.append(sentence[i:wb])
        i = wb-1
    else:
        i += 1

As should be obvious, the code to accomplish this manually is absolutely non-trivial, and required significant change to be able to match only a slightly different pattern. If you are a programmer who intends to do any type of text searching and processing, the re library is an invaluable tool if used correctly. You can save yourself a lot of code-writing with a well-designed regex!