Earlier in this semester, we learned about regular expressions.
As we’ve already discussed, regular expressions are used to search and match patterns of characters in text.
In particular, we primarily used regular expressions with the grep
program in bash.
In these notes, we’ll do some basic regular expression review, and then learn how to use it in python.
As we know by now, regular expressions are simply strings that represent a patter to match in another string.
A regular expression string is composed of three types of characters:
A “normal” character. These are the characters retaining their literal meaning. The simplest type of regex consists only of a character set, with no meta-characters. These “regular expressions” represent and match exactly what was typed.
An Anchor.
These designate (anchor) the position in the line of text that the regex is to match.
^
and $
are anchors.
Modifiers. These expand or narrow (modify) the range of text the RE is to match. Modifiers include the asterisk, period, brackets, and the backslash, etc. Modifiers do not match the character that they literally represent.
If you feel in need of extra practice to get back up to speed with regex, go back through the RegexOne Tutorial.
For review, let’s iterate through a few examples on regexr.com, and in bash. For each of the below snippets of text, come up with a regex that matches the top section but not the bottom:
WE ARE GOING TO LEARN
THIS. IS. REGEX.
UPPER CASE EH?
-------
We would LIKE to learn
this. is. also. regex.
lower CASE EH
what is your Name?
one two three
SEVEN EIGHT nine
who are you?
is this real?
-------
why wait?
when can you stop by?
one two THREE FOUR five six
hello!
-1234
+45834959
-5672
-------
23952
not a number
+type
-number
www.gmail.com
www.espn.com
www.arizona.edu
benjdd.com
code.org
-------
we are farmers
the website ends with dot-com
google.au
gee mail dot com
world wide web
amazon.ca
To use regular expressions on a python program, we must import a module named re
.
As with sys
and sqlite3
, the re
module is built-in to python by default, so there is no need to download or install anything separately.
The steps to use a regular expression in python are:
re
moduleThe first step is easy:
import re
Next, we need to come up with a regular expression to use. In this very simple demonstration, we’ll use a regular expression that matches a capitalized word that is two letters or longer:
regex_str = r'[A-Z][a-z]+'
In the next step, we’ll use the compile()
function from the regex module to compile the regex so that it can be used to do searching and matching.
Compiling a regex converts it from a string to a python object which has the logic to do matching.
pattern = re.compile(regex_str)
Let’s check the type
of the pattern
variable:
>>> type(pattern)
<class '_sre.SRE_Pattern'>
_sre.SRE_Pattern
is a special type that is defined internally in the re
module.
It represents a particular regex pattern, which we specified with the regex_str
string.
We can now use the pattern
variable to do searching and matching!
Matching and searching can occur on other python strings.
We will discuss three different functions that the pattern
variable has that we can call to check for the specified pattern:
findall()
search()
match()
findall()
is (arguably) the simplest and easiest to understand of these, so we’ll focus on this function first.
Perhaps we have another variable named sentence
in the program, declared in this way:
sentence = "The boy's father, George, lived in France for 17 years."
Let’s try using findall()
to search for capitalized words in this sentence.
To do this, we’ll use the same regex_str
and pattern
used in the last section.
When called, findall()
will return a list of all strings that it finds within the input string that match the pattern.
So, what do you think this snippet of code will print (first think, then try!):
regex_str = '[A-Z][a-z]+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)
for item in result:
print(item)
Run it to find out the right answer! Since a list is returned, we could also just print the list directly:
regex_str = '[A-Z][a-z]+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)
print(result)
Open up a python terminal session, and try using the same pattern to find matches in a few other strings:
sentence = "One TWO three Four FIVE six Seven EIGHT nine"
sentence = "It was the best of times, It was the worst of times."
sentence = "James, John, Sam, and Alex drove to the Tucson mall together."
Does it work the way you expect it to?
Now that we’ve used the findall()
function a few times, we can start to get a feel for how it works.
However, let’s take a step back to consider an important question:
Why use re
/regular-expressions in the first place?
Couldn’t we just write our own code to search for these things?
Let’s take a look.
As a recap, this is the sequence of code necessary to search for upper-case words in a string:
import re
regex_str = '[A-Z][a-z]+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)
Assuming that the code-writer has some experience with regular expressions, this is rather simple code.
In contrast, here is some code that accomplishes the same thing without using re
(or any other module, for that matter):
result = []
i = 0
while i < len(sentence)-1:
if sentence[i] >= 'A' and sentence[i] <= 'Z':
if sentence[i+1] >= 'a' and sentence[i+1] <= 'z':
j = i+1
while sentence[j] >= 'a' and sentence[j] <= 'z':
j += 1
result.append(sentence[i:j])
i += 1
Try running both of these, to be sure that they give the same results. Which of these is easier to understand? This is an arguable point, but I think most would agree that the former is much more clean and concise.
The usefulness of the re
module because even more obvious when need to match multiple patterns throughout a program.
Perhaps later in the same program, the program needs to match for another pattern, like this:
regex_str = r'\d+ \w+'
pattern = re.compile(regex_str)
result = pattern.findall(sentence)
In this case the pattern
specifies a number, followed by a word.
We only had to make minimal changes the code used before (only the regex_str
had to change slightly, to support a new type of string).
How can this match be accomplished without re
?
One way to do so:
result = []
i = 0
while i < len(sentence):
nb = i
if sentence[i].isdigit():
while nb < len(sentence) and sentence[nb].isdigit():
nb += 1
if nb < len(sentence) and sentence[nb] is ' ':
wb = nb + 1
while wb < len(sentence) and sentence[wb].isalpha():
wb += 1
if wb < len(sentence) and sentence[wb] is ' ':
result.append(sentence[i:wb])
i = wb-1
else:
i += 1
As should be obvious, the code to accomplish this manually is absolutely non-trivial, and required significant change to be able to match only a slightly different pattern.
If you are a programmer who intends to do any type of text searching and processing, the re
library is an invaluable tool if used correctly.
You can save yourself a lot of code-writing with a well-designed regex!