CSC 250 - Recommender

In this assignment, you will be writing a program that makes music recommendations based on sings/albums that a user is known to likeo. you should call your program recommender.py. Recommendation systems are used in a variety of ways in the real-world to suggest products to users, typically based on their past purchasing or browsing activity. If you are interested, you can read more about them on wikipedia.

The Data

This program will use real amazon music reviews to do the recommendations. The data set that you will be using came from this website: http://jmcauley.ucsd.edu/data/amazon/. Here is a direct download:

music.json (~64k lines)

This file has over 60k amazon music reviews by various users. The reviews span from roughly 2006-2014. After downloading, try opening it up in a text editor and scrolling through it. It is large! Have you ever worked directly with a data set this large?

I also provide some smaller versions of this file. You can use these while testing your program, before trying it out on the large data set:

music_med.json (~21k lines)

music_small.json (~6k lines)

This data is on JSON format, which we have not talked about much in this course. However, you don’t need to worry about parsing it in recommender.py. I have written a python script which reads in this data and loads it into a sqlite database. Download this python script by clicking below:

load_data.py

To use a particular data file, just re-run load_data.py with the file of choice. The schema of the created database is:

CREATE TABLE review(
    asin TEXT,
    reviewer_id TEXT,
    review_text TEXT);

If you connect to this database and run SELECT * FROM review;, you should see a bunch of rows! This is the data that you’ll use for the project.

What are the columns?

The first is asin. This stands for “Amazon Standard Identification Number”. These numbers are used to identify amazon products. In fact, you can navigate directly to a product’s page on amazon by adding this to an amazon URL. The template is as follows:

https://www.amazon.com/dp/PUT_ID_HERE

For example, the asin of a Nest Learning Thermostat is B0131RG6VK. Thus, if we go to: https://www.amazon.com/dp/B0131RG6VK, it will take use to that page.

reviewer_id is just the ID of the user that gave this review. review_text is the text of the review!

The Recommender

The next step is to write the actual recommender. I am providing you with a little bit of starter code:

recommender.py

This just has a few things to help you get started, but you’ll need to add lots of code to get it working.

An Example

Let’s say you are a person who really likes Linkin Park. You create a likes.txt file with onle one asin in it:

B00008H2LB

Then, you run the suggester with this file. The run should result in something like:

file with list of liked products: likes.txt
Computing recommendations . . .
Recommended music based on your preference(s):
https://www.amazon.com/dp/B00008H2LB  (Score=65743)
https://www.amazon.com/dp/B003V5PPZG  (Score=23450)
https://www.amazon.com/dp/B000BKSISA  (Score=9392)
https://www.amazon.com/dp/B00JYKU6BK  (Score=7610)
https://www.amazon.com/dp/B0007NFL18  (Score=5592)
https://www.amazon.com/dp/B000021YQV  (Score=5485)
https://www.amazon.com/dp/B00004XOWM  (Score=4819)
https://www.amazon.com/dp/B00005AAFJ  (Score=4270)
https://www.amazon.com/dp/B0000AGWFX  (Score=4145)
https://www.amazon.com/dp/B00006690F  (Score=3994)

The result is a list of the top-10 products that the recommender found, based on your like. A score is also listed - more on that later.

Several additional examples will be shown at the end of the spec. Your program’s top-10 recommendations to not need to match mine exactly. However, at least 5 of the products my program recommends must show up in the top-10 of yours.

How does it work?

Going from one (or several) asins and a large data set to a list of 10 recommendations might seem challenging. It might be, but I’m going to provide you with an outline of what to do, do don’t fear.

Step 1

First, read in the asins from the likes.txt file. Once you have each of these, loop through them. For each one of the asins, do a SELECT query in the database to get all of the reviews associated with them. You should put each of the review texts into one list.

Step 2

Once you have all of the reviews, the next step is to determine all of the two-word phrases in the reviews. We will use the two-word phrases to help determine what other reviews are similar. For example, if a review says:

This album was amazing!

There would be three total two-word pairs: This album, album was, and was amazing. You should build a long list of all of the two-word phrases in the liked product reviews. In the starter code, I include a list of words called disclude. If a two-word phrase includes one of these words, don’t include it in the results. Once you have these two-word pairs, count how many times each occurs, and remove all that appears less-than 7 times.

Step 3

The next step is to group all of the review text for all of the other reviews in the data set. Basically, you want to build a dictionary that maps an asin (string) to a single string, which is all of the review text for that asin concatenated together. Once you have this, you’ll combine the results from both steps 2 and three to find the similarities.

Step 4

Here is where you combine the results of step two (the two-word pairs with their counts) with the results of step three (all the review text for every product). For each produce in the data set, you can compute a score for how highly the product should be recommended. For each two-word pair, check if it exists in the review text. Fr each one that does, multiply the number of times the two-word pair was seen from the “liked” product review by how many times it appears in the product review. Sum all of these scores together for all two-word pairs, and that is the “score” for that product.

Step 5

Submission

This assignment will be due Monday, April 16th at 5:00pm. Turn it in to the D2L dropbox.

Example B

Running the same tests with the single Linkin Park album in the likes.txt file, except with the small data set, should produce the following suggestions:

file with list of liked products: likes.txt
Computing recommendations . . .
Recommended music based on your preference(s):
https://www.amazon.com/dp/B00008H2LB  (Score=470)
https://www.amazon.com/dp/B003V5PPZG  (Score=233)
https://www.amazon.com/dp/B0000AGWFX  (Score=144)
https://www.amazon.com/dp/B000JVSZIY  (Score=72)
https://www.amazon.com/dp/B00002MZ2C  (Score=68)
https://www.amazon.com/dp/B000003AEK  (Score=63)
https://www.amazon.com/dp/B000BKSISA  (Score=60)
https://www.amazon.com/dp/B000002P64  (Score=54)
https://www.amazon.com/dp/B0000013GT  (Score=45)
https://www.amazon.com/dp/B00000054H  (Score=36)
https://www.amazon.com/dp/B00JYKU6BK  (Score=34)
https://www.amazon.com/dp/B000000W2Z  (Score=27)

Example C

This is another example using the small data set. If the contents of the likes.txt file is this:

B00000017R
B00000064F
B000000OPC
B000000OYQ
B000000Y2R
B000001AJI
B0000024S9
B0000025F7
B0000025YM
B0000025Z4
B0000026IC
B0000026UV
B0000029AN
B000002AP1

file with list of liked products: likes.txt
Computing recommendations . . .
Recommended music based on your preference(s):
https://www.amazon.com/dp/B0000025F7  (Score=301)
https://www.amazon.com/dp/B000002AP1  (Score=209)
https://www.amazon.com/dp/B0000026WD  (Score=199)
https://www.amazon.com/dp/B0000025RI  (Score=148)
https://www.amazon.com/dp/B00000269M  (Score=115)
https://www.amazon.com/dp/B00005O54Q  (Score=86)
https://www.amazon.com/dp/B00004T9UF  (Score=78)
https://www.amazon.com/dp/B00065XJ52  (Score=69)
https://www.amazon.com/dp/B00006690F  (Score=66)
https://www.amazon.com/dp/B0000025Z4  (Score=59)
https://www.amazon.com/dp/B000FPYNQW  (Score=55)
https://www.amazon.com/dp/B00000JWQH  (Score=54)

CSc 250 - Recommender