In these notes, we will learn how to write a python application that extracts email addresses from text file using re
.
In this exercise, we are going to write a program that uses the re
modules to extract email addresses from a text file.
For starters, we’ll get the program to extract email addresses from a hard-coded string, and the file reading can be added later.
For the time being, let’s assume that we want to extract the emails from the following string:
text = '''
If you would like to get advising help, send a message to advising@cs.arizona.edu.
Send email to webmaster@cs.arizona.edu if you'd like to report issues with the CS website.
The best email to contact the instructor at is either:
instructorname@email.arizona.edu or instructorname@cs.arizona.edu.
'''
Next step: what regular expression do we need?
What pattern should we be looking for to extract emails from text?
In English, what we’re looking for is: a sequence of letters and symbols, followed by an “@” sign, followed by another sequence of letters and symbols that has at least one dot in it, and ends with a letter. In regex, this would (roughly) be:
[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+
Now that the regex string is defined, let’s put the code together to do the matching:
import re
text = '''
If you would like to get advising help, send a message to advising@cs.arizona.edu.
Send email to webmaster@cs.arizona.edu if you'd like to report issues with the CS website.
The best email to contact the instructor at is either:
instructorname@email.arizona.edu or instructorname@cs.arizona.edu.
'''
regex_str = r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+'
pattern = re.compile(regex_str)
result = pattern.findall(text)
for email in result:
print(email)
Run the code and try it out… does it work?
Let’s modify the program to search for emails in a file, rather than just a fixed string.
First, the program should get a file name from a user that will be read and searched through:
import sys
print('Enter a file to search through:')
input_file = sys.stdin.readline().strip()
Next, the program will read the file and append all of the contents to the text
string, which will be searched through with the regular expression later.
text = ''
f = open(input_file, 'r')
for line in f:
test.append(line)
Then, do the regular expression search, just like was done before:
regex_str = r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+'
pattern = re.compile(regex_str)
result = pattern.findall(text)
for email in result:
print(email)
Putting it all together:
import sys
import re
print('Enter a file to search through:')
input_file = sys.stdin.readline().strip()
text = ''
f = open(input_file, 'r')
for line in f:
text += line
regex_str = r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z0-9]+'
pattern = re.compile(regex_str)
result = pattern.findall(text)
for email in result:
print(email)
Try running this script on this large text file, containing several emails in it: