This lab was designed to be completed during lab-time for cs250. This lab is not graded for correctness. However, you should work through all of the problems, as it will help you on assignments and exams. If you cannot complete the lab in the alloted time, finish at home.
This particular lab was designed to be completed on the Mac lab machines in GS 930
Did you really do problem 1?
Seriously?
Most of the example data sets we have encountered in class have been “clean” data sets. By that, I mean that we’ve been looking at data files (mostly CSV) that are correctly formatted, and the information within it is neatly structured. However, out in the real-world, data is not always this clean. Often, data scientists and programmers need to write programs that do additional processing on data to format it in such a way that it can be better used with other programs. These programs could be custom ones written by the programmer, or standard ones like pages, excel, and SQLite.
The next few problems will lead up to Should help you “warm-up” to the more difficult last problem, in which you will be writing a script to clean a CSV file. If you don’t make it to the last in lab time, don’t worry. Though, I do recommend you continue on your own.
Take a look at this line of python:
''.join(random.choice(string.ascii_letters) for i in range(12))
Open up a python terminal, run import random
and import string
, and then try running this line ~5 or so times.
What does it spit out on each run?
Is the result the same each time?
What this line of code does is generate a string of random characters of length 12.
The length can be easily changed, by changing the 12
at the end of the line of code.
You don’t need to fully understand how this works yet, because it uses some modules that we’ve not yet learned in class.
Create a file named lab13.py
, and in it write a function named get_random_csv_row
.
This function should take two arguments: The first is a number representing how many columns the row should have, and the second is the number of characters that should be in each value.
The function should create this as a string, and then return it.
Below is a python session in which I import lab13
and then test this function.
Your’s should behave similarly:
>>> import lab13
>>> lab13.get_random_csv_row(3,3)
'gqR, GMo, nhl'
>>> lab13.get_random_csv_row(3,9)
'ROhcQpUWs, FYLWIonXX, ebDINVOVW'
>>> lab13.get_random_csv_row(7,2)
'EG, Sa, mH, ZQ, ui, qY, eg'
>>> lab13.get_random_csv_row(20,2)
'Fw, ey, Dp, ZN, Ue, sY, ym, Cz, uh, gh, Vc, ql, hV, pZ, ij, IR, Ei, pd, Ur, pL'
We’ve encountered a lot of CSV files in this class. Below is a “randomly” selected example line from such a file:
1,Allen,Miss Elisabeth Walton,1st,29,63,female,1,1,76,19,Engineer
At other times, one can encounter lines in a CSV file with missing data points:
1,,Miss Elisabeth Walton,1st,,,female,1,1,,19,Engineer
Notice that the line has the same number of commas (which is good), but some commas have no “value” in-between them. Instead of leaving the value as blank, we may want to populate those with an indicator representing that i’s a missing or incomplete values.
Add a function named replace_empty_csv_vals
to lab13.py
that will do this replacement.
This function should take one argument (a string, which is a CSV line).
It will return a different string, with all of the empty values replaced by ‘UNK’ (for “unknown”).
Essentially, you want to replace each ,,
with ,UNK,
, but note that some comma pairs (,,
) could be overlapping, like ,,,,
.
Below is an example python session, demonstrating how it should work:
>>> import lab13
>>> lab13.replace_empty_csv_vals('LeBron,James,,23,,,Forward,,,2')
'LeBron,James,UNK,23,UNK,UNK,Forward,UNK,UNK,2'
>>> lab13.replace_empty_csv_vals(',,Jiminy,,Smith,,,')
'UNK,UNK,Jiminy,UNK,Smith,UNK,UNK,UNK'
>>> lab13.replace_empty_csv_vals(',,,')
'UNK,UNK,UNK,UNK'
>>> lab13.replace_empty_csv_vals('Sally,Sander,34,,,sally@gmail.com,,,,,,,13721,,,,520-904-1788,,,')
'Sally,Sander,34,UNK,UNK,sally@gmail.com,UNK,UNK,UNK,UNK,UNK,UNK,13721,UNK,UNK,UNK,520-904-1788,UNK,UNK,UNK'
Notice that if the input string begins with a comma, UNK
should be added to the beginning.
Similarly for the end of the string.
For this problem, let’s say you have a csv file named Titanic.csv
, similar to the Titanic data set from vincentarelbundock.github.io.
This csv file represents information about each person who ws on the Titanic before it sank.
The first few lines of this file look like this:
ID,Name,PClass,Age,Sex,Survived,SexCode
1,Allen, Miss Elisabeth Walton,1st,29,female,1,1,
2,Allison, Miss Helen Loraine,1st,2,female,0,1
3,Allison, Mr Hudson Joshua Creighton,1st,30,male,0,0
4,Allison, Mrs Hudson JC (Bessie Waldo Daniels),1st,25,female,0,1,
5,Allison, Master Hudson Trevor,1st,0.92,male,1,0
...
As we can see, the first row describes the name of each of the 7 columns. The first few rows below seem to be well formatted. But let’s take a look at the rest of the file:
ID,Name,PClass,Age,Sex,Survived,SexCode
1,Allen, Miss Elisabeth Walton,1st,29,female,1,1,
2,Allison, Miss Helen Loraine,1st,2,female,0,1
3,Allison, Mr Hudson Joshua Creighton,1st,30,male,0,0
4,Allison, Mrs Hudson JC (Bessie Waldo Daniels),1st,25,female,0,1,
5,Allison, Master Hudson Trevor,1st,0.92,male,1,0
6,Anderson, Mr Harry,1st,47,male,1,0
??
7,Andrews, Miss Kornelia Theodosia,1st,63,female,1,1 , , , , ,,,,,
8,Andrews, Mr Thomas, jr,1st,39,male,0,0
9,Appleton, Mrs Edward Dale (Charlotte Lamson),1st,58,female,1,1,,,,,,,,
10,Artagaveytia, Mr Ramon,1st,71,male,0,0
11,Astor, Colonel John Jacob,1st,47,male,0,0
12,Astor, Mrs John Jacob (Madeleine Talmadge Force),1st,19,female,1,1, A,B,C,D,E
13,Aubert, Mrs Leontine Pauline,1st,NA,female,1,1
14,Barkworth, Mr Algernon H,1st,NA,male,1,0 , ? , ? , ? , ?
??
15,Baumann, Mr John D,1st,NA,male,0,0
16,Baxter, Mrs James (Helene DeLaudeniere Chaput),1st,50,female,1,1
??
XXXXX
17,Baxter, Mr Quigg Edmond,1st,24,male,0,0
18,Beattie, Mr Thomson,1st,36,male,0,0
19,Beckwith, Mr Richard Leonard,1st,37,male,1,0
20,Beckwith, Mrs Richard Leonard (Sallie Monypeny),1st,47,female,1,1
ABC
XXXXXXXX
21,Behr, Mr Karl Howell,1st,26,male,1,0
22,Birnbaum, Mr Jakob,1st,25,male,0,0
23,Bishop, Mr Dickinson H,1st,25,male,1,0
24,Bishop, Mrs Dickinson H (Helen Walton),1st,19,female,1,1 , , , , , ,
XXX
25,Bjornstrm-Steffansson, Mr Mauritz Hakan,1st,28,male,1,0
XX
Yikes! Clearly, many of the rows further down are messed up. If you take a close look, you’ll see that there are a few problems:
??
and XXX
If you tried importing this into a table in SQLite 3, it would fail miserably.
In this problem, your task is to write a python script that will fix this file.
No fixing the file by hand! That’s cheating!
Your program should read the Titanic.csv
file in line-by-line.
For each line, determine if it has any of the above-described problems.
If so, fix it, then re-write it to a new file named CleanTitanic.csv
You’ll be able to tell your script is working correctly when you can successfully import CleanTitanic.csv
into a SQLite database without errors.