Benford's Law

Benford’s Law is a mathematical law that describes the behavior of naturally-occurring numbers in some kinds of numerical data sets. I recommend that you watch this video before proceeding to get an explanation:

Benford’s law is useful for distinguishing naturally occurring data from randomized or made-up data. It has been used in the real world to detect election fraud (For example, in the 2009 Iranian election). It has also been used as evidence in criminal cases in the US. In this PA, you’ll be writing a program that reads in a data set, and prints out the plot of first-digits. Then, you can look at the plot to determine if it conforms to the law or not! Name your file benfords_law.py. You should organize the code into several functions: main, one for loading the file, one for counting the occurrences, and one for printing the plot.

The Input File

Your program should as the user for the name of an input file, which your program should expect to be formatted as CSV. If you don’t know what a CSV file is, or forgot, go watch the video quiz that covered it! Shown below is an example of the program prompting the user for a file name, the types file name (places.csv), and then printing out the plot.

Data file name:
places.csv

1 | ###############################
2 | ##############
3 | ##########
4 | ########
5 | #######
6 | #####
7 | ####
8 | ####
9 | ####

Follows Benford's Law

After opening up the input file, the program should search through the CSV data for numerical values. The way you should do so is as follows:

region,population
pima,1234
georgia,145
steele,10
tampa,1700
greece,1729
rome,1711
milan,219
tucson,231
tuscany,20001
florence,301
nigeria,3879
newyork,404
phoenix,40123
belgium,505
madrid,502
nogales,601
brussels,712
tempe,81231
anthem,91231

numbers = [1234.0, 145.0, 10.0, 1700.0, 1729.0, 1711.0, 219.0, 231.0, 20001.0, 301.0, 
           3879.0, 404.0, 40123.0, 505.0, 502.0, 601.0, 712.0, 81231.0, 9123.0]

The Plot

In order to create the plot, you will first have to loop through the numbers list and count how many times a number starts with the digit 1, the digit 2, the digit 3, and so on up to 9. I recommend that you use a dictionary for this counting. If you have a floating-point number x, you can get the first digit as an int by doing int(str(x)[0]). Based on the places.csv data shown earlier, the counts dictionary should be as follows after counting:

counts = {1: 6, 2: 3, 3: 2, 4: 2, 5: 2, 6: 1, 7: 1, 8: 1, 9: 1}

If you forgot how to use a dictionary to count things, go watch the video quiz where I showed how to do so! After counting, loop through the numbers 1 through 9 and figure out the percentage that each occurs. You will use these percentages both to print out the bar chart, and to check if the data follows the law. The way that you would calculate the percentage for a particular digit, as an integer, is:

(count_for_digit / length_of_numbers_list) * 100

The number of # for a digit in the plot should be the same as the percentage of the data that digit appears first. For example, in the places.csv data, there were 3 numbers that started with the digit 2 and there were a total of 19 numbers from the data set, then you should print out int((3 / 19) * 100) = 15%. Thus, 15 hashtags for 2. For each row of the plot, print out the digit, a vertical bar, and then the hashtags. The plot that should print based on the places.csv example is:

1 | ###############################
2 | ###############
3 | ##########
4 | ##########
5 | ##########
6 | #####
7 | #####
8 | #####
9 | #####

Does it follow the Law?

The other thing you should determine is if the data follows Benford’s Law. For the purposes of this PA, a data set will follow Benford’s law if the percentage of occurrences of each digits follows the following percentages, plus 10% or minus 5%.

digit	percent
1	30%
2	17%
3	12%
4	9%
5	7%
6	6%
7	5%
8	5%
9	4%

If every digit follows, then print out Follows Benford's Law. Otherwise, print out Does not follow Benford's Law.

Examples

Population Data

The populations.csv file contains population information from many countries across the map. If you download this file and run it with your code, you should get:

Data file name:
populations.csv

1 | ##################################
2 | ###############
3 | ###########
4 | ########
5 | ########
6 | ######
7 | ####
8 | #####
9 | ####

Follows Benford's Law

Stock Data

The stocks.csv file contains open, max, min, and closing prices for stocks traded on the NYSE from 10/7/2019. If you run the code with this data, you should get:

Data file name:
stocks.csv

1 | ##############################
2 | #########################
3 | ##########
4 | #######
5 | #######
6 | #####
7 | ####
8 | ###
9 | ####

Follows Benford's Law

Random Data

The random_numbers.csv file contains a bunch of randomly generated numbers. Due to this, we should NOT expect it to follow Benford’s law. When the code is run with these numbers, you should get:

Data file name:
random_numbers.csv

1 | ##########
2 | ###########
3 | ############
4 | ############
5 | ##########
6 | ############
7 | #########
8 | ##########
9 | ###########

Does not follow Benford's Law

Turning it In

This is due on April 20th, 2021 at 7pm on Gradescope. Make sure that you follow the style guide rules.

CSc 110 - Benford’s Law