CSc 250: Lecture Notes: Finding Email Addresses with re

In these notes, we will learn how to write a python application that extracts email addresses from text file using re.

Finding email addresses

In this exercise, we are going to write a program that uses the re modules to extract email addresses from a text file. For starters, we’ll get the program to extract email addresses from a hard-coded string, and the file reading can be added later. For the time being, let’s assume that we want to extract the emails from the following string:

text = '''
  If you would like to get advising help, send a message to advising@cs.arizona.edu.
  Send email to webmaster@cs.arizona.edu if you'd like to report issues with the CS website.
  The best email to contact the instructor at is either:
    instructorname@email.arizona.edu or instructorname@cs.arizona.edu.
'''

Next step: what regular expression do we need?

What pattern should we be looking for to extract emails from text?

In English, what we’re looking for is: a sequence of letters and symbols, followed by an “@” sign, followed by another sequence of letters and symbols that has at least one dot in it, and ends with a letter. In regex, this would (roughly) be:

[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+

Now that the regex string is defined, let’s put the code together to do the matching:

import re

text = '''
  If you would like to get advising help, send a message to advising@cs.arizona.edu.
  Send email to webmaster@cs.arizona.edu if you'd like to report issues with the CS website.
  The best email to contact the instructor at is either:
    instructorname@email.arizona.edu or instructorname@cs.arizona.edu.
'''

regex_str = r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+'
pattern = re.compile(regex_str)
result = pattern.findall(text)

for email in result:
    print(email)

Run the code and try it out… does it work?

Reading from a File

Let’s modify the program to search for emails in a file, rather than just a fixed string.

First, the program should get a file name from a user that will be read and searched through:

import sys
print('Enter a file to search through:')
input_file = sys.stdin.readline().strip()

Next, the program will read the file and append all of the contents to the text string, which will be searched through with the regular expression later.

text = ''
f = open(input_file, 'r')
for line in f:
    test.append(line)

Then, do the regular expression search, just like was done before:

regex_str = r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+'
pattern = re.compile(regex_str)
result = pattern.findall(text)
for email in result:
    print(email)

Putting it all together:

import sys
import re

print('Enter a file to search through:')
input_file = sys.stdin.readline().strip()

text = ''
f = open(input_file, 'r')
for line in f:
    text += line

regex_str = r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+'
pattern = re.compile(regex_str)
result = pattern.findall(text)

for email in result:
    print(email)

Try running this script on this large text file, containing several emails in it:

chapter-2.txt