Its a very interesting topic which uses data analysis. Recommender system is used by Amazon, Flipkart and many other websites to show you “Products you might like”. The system try to find out which products the user might like to purchase based on his/her previous purchase history. This post will explain the Recommender system and will give you an overview of building it using python.
Their are two kinds of Recommender systems – ‘User-Based Collaborative Filtering‘ and ‘Item-Based Collaborative Filtering’. This blog post will cover the theoretical aspects as well as practical implementation using demo database with python. First, lets try to understand the basics of both –
1. User Based Collaborative Filtering
- Build a matrix of things each user bought/viewed/rated
- Compute similarity scores between users
- Find users similar to you
- Recommend stuff they bought/viewed/rated that you haven’t yet.
Let me explain this by example. Let say there are two people A and B. A has bought Puma shoes as well as Puma T-shirt. B has bought only Puma shoes. We will say user A and B are similar because both has bought Puma Shoes. Now since A has also bought Puma shoes, therefore this might be a good recommendation for B.
But there are few disadvantages in User Based Collaborative Filtering –
- People are fickle; taste changes. Therefore, comparing people to people might not be a good idea.
- There are usually many more people than things. Therefore, more computation to do.
2. Item-Based Collaborative Filtering
I will try to explain it with example of a movie Recommender system of Netflix-
- Find every pair of movies that were watched by the same person.
- Measure the similarity of their ratings across all users who watched both
- Sort by movie, then by similarity strength.
P.S.: This is just one way of doing it.
Let me explain this by example. Let us say A has watched two movies ‘3 idiots’ and ‘PK’ and rated them 5 stars. B has also watched the same two movies and also rated them 5 stars. Based on these two users we would say there is some sort of relationship between ‘3 idiots’ and ‘PK’. Now, C has only watched ‘3 idiots’ but has not watched ‘PK’. Since we have computed the relationship between ‘3 idiots’ and ‘PK’ based on liking of A and B, therefore given that C liked ‘3 idiots’, we have high probability that he will also like PK. Hence, here we looked at relationship between items and not people.
Now let’s come to implementation of Recommender System Using Python. Following are the prerequisites for this project –
- Python with packages pandas and numpy.
- ‘Movie Lens’ Data Set.
About the Data Set –> We have following two files:
- u.item – Following is the screenshot of sample u.item data file:
It consists of two columns showing movie_id and movie_name. Total Number of rows is 1682. Hence, 1682 movies.
- u.data – It consists of three columns showing user_id, movie_id and rating. Total Number of Rows is 100000. Hence, 100000 ratings.
First, I will build a simple Recommender System to recommend movies to person who has rated star wars.
Here is the code:
import pandas as pd import numpy as np r_cols = ['user_id','movie_id','rating'] ratings = pd.read_csv('u_data_new.csv',sep=',',names=r_cols,usecols=range(3),encoding='ISO-8859-1') m_cols = ['movie_id', 'title'] movies = pd.read_csv('u.item',sep='|',names=m_cols,usecols=range(2),encoding='ISO-8859-1') ratings = pd.merge(movies,ratings,on='movie_id') movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating') starWarsRatings = movieRatings['Star Wars (1977)'] similarMovies = movieRatings.corrwith(starWarsRatings) similarMovies = similarMovies.dropna() df = pd.DataFrame(similarMovies) df.head(10) similarMovies.order(ascending=False,inplace=True) #print(similarMovies) #BUT ABOVE DOES NOT SHOWS CORRECT RESULTS movieStats = ratings.groupby(['title']).agg(['count','mean']) movieStats = movieStats['rating'] popularMovies = movieStats[movieStats['count'] >=100] popularMovies.sort(columns='mean',ascending=False,inplace=True) #print(popularMovies) df = pd.concat([popularMovies,similarMovies],axis=1) df.dropna(inplace=True) df.columns = ['rating_size','rating_mean','similarity'] df.sort(columns='similarity',ascending=False,inplace=True) print(df.head(11))
The df will have all the movies and their corresponding similarity score compared with star wars. The output of the code (df) is as follows:
Star Wars (1977) 1.000000
Empire Strikes Back, The (1980) 0.748353
Return of the Jedi (1983) 0.672556
Raiders of the Lost Ark (1981) 0.536117
Austin Powers: International Man of Mystery (1997) 0.377433
Sting, The (1973) 0.367538
Indiana Jones and the Last Crusade (1989) 0.350107
Pinocchio (1940) 0.347868
Frighteners, The (1996) 0.332729
L.A. Confidential (1997) 0.319065
Wag the Dog (1997) 0.318645
Dumbo (1941) 0.317656
This shows top 10 most similar movies to Star Wars which should be recommended to the user. The similarity score of Star Wars will obviously 1 because Star Wars is perfectly similar to itself. Hence, the top movie to be recommended to any user who has rated Star Wars is “Empire Strikes Back”.
Here is the full actual item based-CF code:
#REAL ITEM BASED COLLABORATIVE FILTERING import pandas as pd import numpy as np r_cols = ['user_id','movie_id','rating'] ratings = pd.read_csv('u_data_new.csv',sep=',',names=r_cols,usecols=range(3),encoding='ISO-8859-1') m_cols = ['movie_id', 'title'] movies = pd.read_csv('u.item',sep='|',names=m_cols,usecols=range(2),encoding='ISO-8859-1') #print(ratings.head()) #print(movies.head()) ratings = pd.merge(movies,ratings,on='movie_id') userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating') #print(userRatings.head()) corrMatrix = userRatings.corr() #print(corrMatrix.head()) #every movie is correlated with every other movie corrMatrix = userRatings.corr(method='pearson',min_periods=100) #print(corrMatrix.head()) myRatings = userRatings.loc.dropna() #ratings of 1st user #Now, let's go through each movie rated by user1 and build up a list of possible recommendations based on the movies similar to the ones rated by the user1. #So for each movie user1 rated, we will retrieve the list of similar movies from our correlation matrix. We will then scale those correlation scores by how well user1 rated the movie they are similar to, so movies similar to ones liked by user1 count more than movies similar to ones user1 hated. simCandidates = pd.Series() for i in range(0,len(myRatings.index)): print("Adding sims for "+myRatings.index[i] + "...") sims = corrMatrix[myRatings.index[i]].dropna() #print(myRatings) sims = sims.map(lambda x:x*myRatings[i]) #Now scale its similarity by how well user1 rated simCandidates = simCandidates.append(sims) #Add the score to the list of similarity candidates print("Sorting...") simCandidates = simCandidates.groupby(simCandidates.index).sum() simCandidates.order(ascending=False,inplace=True) print(simCandidates.head(10)) #drop movies watched(i.e. rated) by user1 filteredSims = simCandidates.drop(myRatings.index) print(filteredSims)
The output will show all the movies titles along with their similarity scores as per the movies watched by the user1. The output showing top 10 results is as follows:
Return of the Jedi (1983) 7.178172
Raiders of the Lost Ark (1981) 5.519700
Indiana Jones and the Last Crusade (1989) 3.488028
Bridge on the River Kwai, The (1957) 3.366616
Back to the Future (1985) 3.357941
Sting, The (1973) 3.329843
Cinderella (1950) 3.245412
Field of Dreams (1989) 3.222311
Wizard of Oz, The (1939) 3.200268
Dumbo (1941) 2.981645
This shows that as per the movies rated by user1, the top movie to be recommended to the user1 should be “Return of the Jedi”.
Please note that this is just one way of doing it. We can certainly improve over the results a lot. I hope you got a basic idea of the working of Recommender system. If you want to discuss anything related to this topic you can surely ping me up or comment below.
If you liked this, share it with your friends and let them know the power of python !