In this assignment, you will be writing a program that makes music recommendations based on sings/albums that a user is known to likeo.
you should call your program recommender.py
.
Recommendation systems are used in a variety of ways in the real-world to suggest products to users, typically based on their past purchasing or browsing activity.
If you are interested, you can read more about them on wikipedia.
This program will use real amazon music reviews to do the recommendations. The data set that you will be using came from this website: http://jmcauley.ucsd.edu/data/amazon/. Here is a direct download:
This file has over 60k amazon music reviews by various users. The reviews span from roughly 2006-2014. After downloading, try opening it up in a text editor and scrolling through it. It is large! Have you ever worked directly with a data set this large?
I also provide some smaller versions of this file. You can use these while testing your program, before trying it out on the large data set:
This data is on JSON format, which we have not talked about much in this course.
However, you don’t need to worry about parsing it in recommender.py
.
I have written a python script which reads in this data and loads it into a sqlite database.
Download this python script by clicking below:
Do get all of the data loaded into a database, do the following:
music.json
and load_data.py
recommender
and place both of these files in itload_data.py
.
It might take a few seconds.review.db
in this directory.
This is a sqlite3 database file, with the data loaded into a table named review
.To use a particular data file, just re-run load_data.py
with the file of choice.
The schema of the created database is:
CREATE TABLE review(
asin TEXT,
reviewer_id TEXT,
review_text TEXT);
If you connect to this database and run SELECT * FROM review;
, you should see a bunch of rows!
This is the data that you’ll use for the project.
There are three columns in this table.
The first is asin. This stands for “Amazon Standard Identification Number”. These numbers are used to identify amazon products. In fact, you can navigate directly to a product’s page on amazon by adding this to an amazon URL. The template is as follows:
https://www.amazon.com/dp/PUT_ID_HERE
For example, the asin of a Nest Learning Thermostat is B0131RG6VK
.
Thus, if we go to: https://www.amazon.com/dp/B0131RG6VK
, it will take use to that page.
reviewer_id
is just the ID of the user that gave this review.
review_text
is the text of the review!
The next step is to write the actual recommender. I am providing you with a little bit of starter code:
This just has a few things to help you get started, but you’ll need to add lots of code to get it working.
The idea with the recommender is as follows:
likes.txt
There will be one asin per-line.recommender.py
, and tell the program where their likes.txt
file is.Let’s say you are a person who really likes Linkin Park. You create a likes.txt file with onle one asin in it:
B00008H2LB
(Go to the amazon page to see what the product is).
Then, you run the suggester with this file. The run should result in something like:
file with list of liked products: likes.txt
Computing recommendations . . .
Recommended music based on your preference(s):
https://www.amazon.com/dp/B00008H2LB (Score=65743)
https://www.amazon.com/dp/B003V5PPZG (Score=23450)
https://www.amazon.com/dp/B000BKSISA (Score=9392)
https://www.amazon.com/dp/B00JYKU6BK (Score=7610)
https://www.amazon.com/dp/B0007NFL18 (Score=5592)
https://www.amazon.com/dp/B000021YQV (Score=5485)
https://www.amazon.com/dp/B00004XOWM (Score=4819)
https://www.amazon.com/dp/B00005AAFJ (Score=4270)
https://www.amazon.com/dp/B0000AGWFX (Score=4145)
https://www.amazon.com/dp/B00006690F (Score=3994)
The result is a list of the top-10 products that the recommender found, based on your like. A score is also listed - more on that later.
If you go to these URLs, you’ll notice a few things:
Several additional examples will be shown at the end of the spec. Your program’s top-10 recommendations to not need to match mine exactly. However, at least 5 of the products my program recommends must show up in the top-10 of yours.
Going from one (or several) asins and a large data set to a list of 10 recommendations might seem challenging. It might be, but I’m going to provide you with an outline of what to do, do don’t fear.
The steps to accomplish the recommendation are roughly as follows.
First, read in the asins from the likes.txt
file.
Once you have each of these, loop through them.
For each one of the asins, do a SELECT
query in the database to get all of the reviews associated with them.
You should put each of the review texts into one list.
Once you have all of the reviews, the next step is to determine all of the two-word phrases in the reviews. We will use the two-word phrases to help determine what other reviews are similar. For example, if a review says:
This album was amazing!
There would be three total two-word pairs: This album
, album was
, and was amazing
.
You should build a long list of all of the two-word phrases in the liked product reviews.
In the starter code, I include a list of words called disclude
.
If a two-word phrase includes one of these words, don’t include it in the results.
Once you have these two-word pairs, count how many times each occurs, and remove all that appears less-than 7 times.
The next step is to group all of the review text for all of the other reviews in the data set. Basically, you want to build a dictionary that maps an asin (string) to a single string, which is all of the review text for that asin concatenated together. Once you have this, you’ll combine the results from both steps 2 and three to find the similarities.
Here is where you combine the results of step two (the two-word pairs with their counts) with the results of step three (all the review text for every product). For each produce in the data set, you can compute a score for how highly the product should be recommended. For each two-word pair, check if it exists in the review text. Fr each one that does, multiply the number of times the two-word pair was seen from the “liked” product review by how many times it appears in the product review. Sum all of these scores together for all two-word pairs, and that is the “score” for that product.
Print out URLS for the 10 products with the highest scores!
This assignment will be due Monday, April 16th at 5:00pm. Turn it in to the D2L dropbox.
Running the same tests with the single Linkin Park album in the likes.txt file, except with the small data set, should produce the following suggestions:
file with list of liked products: likes.txt
Computing recommendations . . .
Recommended music based on your preference(s):
https://www.amazon.com/dp/B00008H2LB (Score=470)
https://www.amazon.com/dp/B003V5PPZG (Score=233)
https://www.amazon.com/dp/B0000AGWFX (Score=144)
https://www.amazon.com/dp/B000JVSZIY (Score=72)
https://www.amazon.com/dp/B00002MZ2C (Score=68)
https://www.amazon.com/dp/B000003AEK (Score=63)
https://www.amazon.com/dp/B000BKSISA (Score=60)
https://www.amazon.com/dp/B000002P64 (Score=54)
https://www.amazon.com/dp/B0000013GT (Score=45)
https://www.amazon.com/dp/B00000054H (Score=36)
https://www.amazon.com/dp/B00JYKU6BK (Score=34)
https://www.amazon.com/dp/B000000W2Z (Score=27)
Notice that these are relatively similar to the results from thelarge data set!
This is another example using the small data set.
If the contents of the likes.txt
file is this:
B00000017R
B00000064F
B000000OPC
B000000OYQ
B000000Y2R
B000001AJI
B0000024S9
B0000025F7
B0000025YM
B0000025Z4
B0000026IC
B0000026UV
B0000029AN
B000002AP1
Then the results should be:
file with list of liked products: likes.txt
Computing recommendations . . .
Recommended music based on your preference(s):
https://www.amazon.com/dp/B0000025F7 (Score=301)
https://www.amazon.com/dp/B000002AP1 (Score=209)
https://www.amazon.com/dp/B0000026WD (Score=199)
https://www.amazon.com/dp/B0000025RI (Score=148)
https://www.amazon.com/dp/B00000269M (Score=115)
https://www.amazon.com/dp/B00005O54Q (Score=86)
https://www.amazon.com/dp/B00004T9UF (Score=78)
https://www.amazon.com/dp/B00065XJ52 (Score=69)
https://www.amazon.com/dp/B00006690F (Score=66)
https://www.amazon.com/dp/B0000025Z4 (Score=59)
https://www.amazon.com/dp/B000FPYNQW (Score=55)
https://www.amazon.com/dp/B00000JWQH (Score=54)