CSc 250 - Final Project - Data Exploration

In this project, you’ll be selecting, analyzing, and visualizing a large data set. This project will be somewhat self-guided, as you’ll see below.

Data Selection

The first step for this project is to select a data set that you want to work with. I am not providing you with a particular data set that you must use. Instead, you are tasked with finding your own. This can either be a data set that you own/create, or it can be something you find online. Ideally, you should choose a data set athta perains to something you are personally interested in. However, there are a few requirements that the data set must meet:

Below are links to places at which you might be able to find some interesting data. You don;t have to use these, but you may.

You can also combine more than one data set, if it makes sense to do so. If you find a data set that you want to work with and it does not fit into these requirements, ask your instructor for permission to use it. If you need help finding an interesting data set, also talk to your instructor.

Database Population

Once you have your data, you need to get this data into a SQLite database. You should write a script named load_data.py which will be responsible for this loading process. If you want, you can use the load_data.py script from the last assignment as a starting point. You can modify that script to work with your particular data set. Your script should create the schema, and then do all of the data loading.

Establishing questions to answer

After you’ve gotten familiar with your data, you should should establish several questions that you want to find answers to uleveraging this data. We did some simple examples of this kind of thing in lab and lecture. You should select 4 questions or hypotheses that your data could help you answer, but that would be difficult to answer without a supporting data set. Make sure to choose questions that are not obvious without the data.

Data Processing and Visualization

Next, you should write several python programs. For each question you decided on, you will write 1 python program. Each program should generate visualizations to help uncover the answer to the related question. The visualizations should be done in python using the matplotlib library. Each of these programs should display at least 2 varieties of visualization, and you should use the matplotlib layout options to get them to show up in the same window. You should make sure each visualization is well-crafted and well-desigedn, based on the principles and exampels we’ve gone over in class. You should also make sure all axes and plots are clearly labeled.

Report

In addition to all of the data processing, question development, and programming, you also should write a report. The report accounts for 40% of your grade for this project. All of the other steps account for the other 60%.

The report should summarize and document the following:

The report should be 3+ pages (including the visualizations). You should submit this as a PDF named report.pdf. You can create it using your editor of choice (perhaps Word, Google Docs, LaTeX, etc).

Submission

This is due on 5/2/2018 (The Wednesday right before reading day) at 11:59 pm. Thus, you have around 2 weeks to work on it. It will all be due at one time, but I have some suggested deadlines:

You may not use any late dats for this project! You must submit your raw data files (csv, xml, json, etc), the load_data.py script, the data vis scripts, and the report file to the dropbox.