First Machine Learning Project

You might have heard a lot about machine learning and data analysis these days. If you are wondering what are the benefits of Machine Learning and how it is used in real world, then this post is for you. Here I will present to you a very basic and interesting Machine Learning project which you can build yourself. I am sure, this will surely give you a start in Machine Learning with python.

Prerequisites for this project – Installed Python and its modules – Pandas and Scikit-learn.

Lets first discuss the problem statement – Given the mass, height, width and colour of some particular fruit, we want to predict the name of the fruit. For this, we need to already know mass, height, width and colour of different fruits. Using that data we will build our Machine Learning system which can then take up any mass, height, width and colour and will return the name of the fruit.

Here is the sample data –

fruit_label fruit_name fruit_subtype mass width height color_score
1 apple granny_smith 192 8.4 7.3 0.55
1 apple granny_smith 180 8 6.8 0.59
1 apple granny_smith 176 7.4 7.2 0.6
2 mandarin mandarin 86 6.2 4.7 0.8
2 mandarin mandarin 84 6 4.6 0.79

We have dataset of 60 such rows. You can download it from fruit_data_with_colors. (right click and save this file)

Now, we will be using this dataset to train our system which can predict the name of the fruit from input mass, width and height. This is called Supervised learning.

In Supervised Learning, we split our data set into 2 sets called Training data and Testing data in ratio of 75%/25% i.e. 75% of our data will be used as Training data and rest 25% will be used at Testing data. (out of 60 rows, 45 rows will be used to train our system for prediction and rest 15 rows will be used to test the accuracy of our trained system.)

Step 1: Import modules and read the data

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.head())
fruit_label fruit_name fruit_subtype mass width height color_score
1 apple granny_smith 192 8.4 7.3 0.55
1 apple granny_smith 180 8.0 6.8 0.59
1 apple granny_smith 176 7.4 7.2 0.60
2 mandarin mandarin 86 6.2 4.7 0.80
2 mandarin mandarin 84 6.0 4.6 0.79

Step 2: Split data into Train and Test data. ‘X’ is input parameters using which prediction is to be made. ‘y’ is the output which is the prediction of fruit name. Following is the way to split the data into default 75%/25% Train-Test.

X = fruits[['mass', 'width', 'height']]
y = fruits['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Step 3: This is the most important step. We will apply K-Nearest Neighbour algorithm of classification. Its pretty easy to understand. Consider the following image.

Lets assume red dots represents apples and green dots represents oranges. Lets say, The points are plotted with mass on x-axis and width on y-axis. The red and green points belong to Train dataset. Let say, we got a new point which is marked as star and we don’t know which fruit it is. We only know its mass and width and hence we plotted this point. Now, as per KNN algorithm, the blue star is most similar to red dots (apples) because it is nearest to red dots. KNN uses a variable K which takes up integers. K=1 represents that the one point closest to blue star will be used for its prediction. K=3 will represent that three closest points to blue star will be checked.

Hence, the new data point (star) will be predicted as red dot (apple). (You can read more about KNN by clicking here.)

This can be done just with two lines of code using python:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)

We can check the accuracy of our using our test data as follows.

print(knn.score(X_test, y_test))

It will give 0.53. Which shows that our system is ~53% accurate.

Step 4: Lets now predict the fruit by giving some random mass, width and height as input. First example: a small fruit with mass 20g, width 4.3 cm, height 5.5 cm. This is how we will do prediction using knn.

fruit_prediction = knn.predict([[20, 4.3, 5.5]])
print(fruit_prediction[0])

The output will be a digit ‘2’ which is a fruit_label which corresponds to ‘mandarin’ fruit.

Similarly, now you can predict the fruit by giving any mass, width and height to knn.predict function.

This is the complete code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.head())

lookup_fruit_name = dict(zip(fruits.fruit_label.unique(), fruits.fruit_name.unique()))
print(lookup_fruit_name)

# For this example, we use the mass, width, and height features of each fruit instance
X = fruits[['mass', 'width', 'height']]
y = fruits['fruit_label']
# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))

# first example: a small fruit with mass 20g, width 4.3 cm, height 5.5 cm
fruit_prediction = knn.predict([[20, 4.3, 5.5]])
print(lookup_fruit_name[fruit_prediction[0]])

# second example: a larger, elongated fruit with mass 100g, width 6.3 cm, height 8.5 cm
fruit_prediction = knn.predict([[100, 6.3, 8.5]])
print(lookup_fruit_name[fruit_prediction[0]])

from adspy_shared_utilities import plot_fruit_knn
plot_fruit_knn(X_train, y_train, 5, 'uniform') # we choose 5 nearest neighbors

I hope with this post you have got a basic idea about what exactly Machine Learning is and how it is used in real world.

Get Free Posts Updates!

Signup now and receive an email once we publish new content.

I agree to have my personal information transfered to MailChimp ( more information )

I will never give away, trade or sell your email address. You can unsubscribe at any time.

If you think this post added value to your python knowledge, click on below link to share it with you friends. It would mean a lot to me and it will help more people reach this post.


 

Web Scrapping Amazon Products with Python

This post is about basic web scrapping with python to scrap amazon’s products name, price and availability every day automatically and save them in a csv file. Now, you might ask why to scrap Amazon product details ? Well, you can do a lot of things by scrapping Amazon products details –

  • Monitor an item for a change in price, stock, Rating etc..
  • Analyse how a particular brand is being sold by amazon.
  • Email you whenever price of an item drops.
  • Or anything else which you can think of.

Let’s begin with our first python scrapper.

First, you must know that all the products of Amazon are identified by “ASIN(Amazon Standard Identification Numbers)” which is a a unique 10 letters and/or numbers amazon uses to keep track of products in its database. ASIN looks like B01MG4G1N4.

The amazon products page are identified by link like this –> “http://www.amazon.in/dp/B01MG4G1N4/”.(www.amazon.in/dp/<ASIN>). 

After collecting ASIN’s of products which you want to scrap we will download the HTML of each product’s page and start identify the XPaths for the data elements that you need – e.g. Product Title, Price, Description etc. Read more about XPaths here.

PYTHON CODE PREREQUISITES:

  • Python 3.0v
  • Python Requests module
  • Python LXML module

Here is the code which scraps products Name, Sale Price, Original Price and Availability. You can directly use this code and make changes as per your needs. This is the best way to learn.

 Enter code from lxml import html  
import csv,os,json
import requests
from time import sleep
import pandas as pd

def Amazon_Parser(url):
    page = requests.get(url)
    doc = html.fromstring(page.content)
    XPATH_NAME = '//h1[@id="title"]//text()'
    XPATH_SALE_PRICE = '//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()'
    XPATH_ORIGINAL_PRICE = '//td[contains(text(),"List Price") or contains(text(),"M.R.P") or contains(text(),"Price")]/following-sibling::td/text()'
    XPATH_AVAILABILITY = '//div[@id="availability"]//text()'

    RAW_NAME = doc.xpath(XPATH_NAME)
    RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
    RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
    RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)

    NAME = ' '.join(''.join(RAW_NAME).split()) if RAW_NAME else None
    SALE_PRICE = ' '.join(''.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
    ORIGINAL_PRICE = ''.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
    AVAILABILITY = ''.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None
     
    if not ORIGINAL_PRICE:
        ORIGINAL_PRICE = SALE_PRICE
 
    #if page.status_code!=200:
    #    raise ValueError('captha')
    data = {
            'NAME':[NAME],
            'SALE_PRICE':[SALE_PRICE],
            'ORIGINAL_PRICE':[ORIGINAL_PRICE],
            'AVAILABILITY':[AVAILABILITY],
            'URL':[url],
           }
    print(data)
    df = pd.DataFrame.from_dict(data)
    return df

def Read_Data():
    AsinList = ['B01MG4G1N4','B071XTFP66','B00RJU3RVS','B01F6WX6LQ']
    extracted_data = []
    df = pd.DataFrame()
    for i in AsinList:
        url = "http://www.amazon.in/dp/"+i
        print("processing: "+url)
        data = Amazon_Parser(url)
        df = df.append(data)
    print(df)
    with open('Gaming_Consoles_Data.csv','a') as f:
        df.to_csv(f,header=False)

if __name__ == "__main__":
    Read_Data()

A output csv file will be made in same directory as that of the script with name “Gaming_Consoles_Data.csv”. It will contain five columns named NAME, SALE_PRICE,ORIGINAL_PRICE,AVAILABILITY and URL as defined by this section of above code –

data = {
         'NAME':[NAME],
         'SALE_PRICE':[SALE_PRICE],
         'ORIGINAL_PRICE':[ORIGINAL_PRICE],
         'AVAILABILITY':[AVAILABILITY],
         'URL':[url],
        }

Note that whenever you will run above code, the products data will be appended in the csv file. This way, we can run it everyday and the data will keep on getting stored in the same file. Hence, later lets say after 1 month, We can analyse it with any price change. But everyday running the script manually is not a good idea. Hence, we will automate running of script.

SCRIPT RUNNING AUTOMATION

Pre-Requisite :

  • Installed Linux/Ubuntu
  • basic knowledge of using terminal commands.
  • Installed pip

I will be using built in cron feature of ubuntu to automate the script running task. Before that let me tell you what is cron. CRON is a system daemon used to execute desired tasks (in the background) at designated times. A crontab file is a simple text file containing a list of commands meant to be run at specified times. It is edited using the crontab command discussed below:

To use cron for tasks meant to run only for your user profile, add entries to your own user’s crontab file. To edit the crontab file open the terminal and enter:

crontab -e

We need to add a command which signifies running of our python script. Add the following command at the bottom of file:

22 00 * * * sh path/to/sh/file/filename.sh

The above command will run the .sh file everyday at 10:00 pm daily. To understand complete format of this command you can read here. Now the question is what is .sh file and what does it contain and how does it run python file ?

sh files are ubuntu (linux) shell executable files which are used to run terminal commands. In our case, the sh file will contain following commands:

#!/bin/sh
source path/to/pip/bin/activate
path/to/installedpython/python path/to/python/script/test.py

Now, what does above commands do ? Well, First command simply specifies the path where shell is located. The second command is used to activate pip where our modules which are to be used in our python script are installed. The third line is used to run python script named ‘test.py’.

So, to conclude, The .sh file has terminal commands which are used to run python script. This .sh file is put on cron which runs it daily at 10 pm.

I hope you enjoyed this post and you got basic idea of scrapping products information from Amazon. If you are stuck anywhere or need any kind of help feel free to comment below.

Thanks

Raghav Chopra

If you think this post added value to your python knowledge, click on below link to share it with you friends. It would mean a lot to me and it will help more people reach this post.

Recommender System with Python

Its a very interesting topic which uses data analysis. Recommender system is used by Amazon, Flipkart and many other websites to show you “Products you might like”. The system try to find out which products the user might like to purchase based on his/her previous purchase history. This post will explain the Recommender system and will give you an overview of building it using python.

Their are two kinds of Recommender systems – ‘User-Based Collaborative Filtering‘ and ‘Item-Based Collaborative Filtering’. This blog post will cover the theoretical aspects as well as practical implementation using demo database with python. First, lets try to understand the basics of both –

1. User Based Collaborative Filtering

  • Build a matrix of things each user bought/viewed/rated
  • Compute similarity scores between users
  • Find users similar to you
  • Recommend stuff they bought/viewed/rated that you haven’t yet.

Let me explain this by example. Let say there are two people A and B. A has bought Puma shoes as well as Puma T-shirt. B has bought only Puma shoes. We will say user A and B are similar because both has bought Puma Shoes. Now since A has also bought Puma shoes, therefore this might be a good recommendation for B.

But there are few disadvantages in User Based Collaborative Filtering –

  • People are fickle; taste changes. Therefore, comparing people to people might not be a good idea.
  • There are usually many more people than things. Therefore, more computation to do.

2. Item-Based Collaborative Filtering

I will try to explain it with example of a movie Recommender system of Netflix-

  • Find every pair of movies that were watched by the same person.
  • Measure the similarity of their ratings across all users who watched both
  • Sort by movie, then by similarity strength.

P.S.: This is just one way of doing it.

Let me explain this by example. Let us say A has watched two movies ‘3 idiots’ and ‘PK’ and rated them 5 stars. B has also watched the same two movies and also rated them 5 stars. Based on these two users we would say there is some sort of relationship between ‘3 idiots’ and ‘PK’. Now, C has only watched ‘3 idiots’ but has not watched ‘PK’. Since we have computed the relationship between ‘3 idiots’ and ‘PK’ based on liking of A and B, therefore given that C liked ‘3 idiots’, we have high probability that he will also like PK. Hence, here we looked at relationship between items and not people.

Now let’s come to implementation of Recommender System Using Python. Following are the prerequisites for this project –

  • Python with packages pandas and numpy.
  • ‘Movie Lens’ Data Set.

About the Data Set –> We have following two files:

  • u.item – Following is the screenshot of sample u.item data file:

    It consists of two columns showing movie_id and movie_name. Total Number of rows is 1682. Hence, 1682 movies.
  • u.data – It consists of three columns showing user_id, movie_id and rating. Total Number of Rows is 100000. Hence, 100000 ratings.

First, I will build a simple Recommender System to recommend movies to person who has rated star wars.

Here is the code:

import pandas as pd
 import numpy as np
 r_cols = ['user_id','movie_id','rating']
 ratings = pd.read_csv('u_data_new.csv',sep=',',names=r_cols,usecols=range(3),encoding='ISO-8859-1')
 m_cols = ['movie_id', 'title']
 movies = pd.read_csv('u.item',sep='|',names=m_cols,usecols=range(2),encoding='ISO-8859-1')
 ratings = pd.merge(movies,ratings,on='movie_id')
 movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
 starWarsRatings = movieRatings['Star Wars (1977)']
 similarMovies = movieRatings.corrwith(starWarsRatings)
 similarMovies = similarMovies.dropna()
 df = pd.DataFrame(similarMovies)
 df.head(10)
 similarMovies.order(ascending=False,inplace=True)
 #print(similarMovies)
 #BUT ABOVE DOES NOT SHOWS CORRECT RESULTS
 movieStats = ratings.groupby(['title']).agg(['count','mean'])
 movieStats = movieStats['rating']
 popularMovies = movieStats[movieStats['count'] >=100]
 popularMovies.sort(columns='mean',ascending=False,inplace=True)
 #print(popularMovies)
 df = pd.concat([popularMovies,similarMovies],axis=1)
 df.dropna(inplace=True)
 df.columns = ['rating_size','rating_mean','similarity']
 df.sort(columns='similarity',ascending=False,inplace=True)
 print(df.head(11))

The df will have all the movies and their corresponding similarity score compared with star wars. The output of the code (df) is as follows:

title                                                                                similarity
Star Wars (1977)                                                           1.000000
Empire Strikes Back, The (1980)                                  0.748353
Return of the Jedi (1983)                                               0.672556
Raiders of the Lost Ark (1981)                                      0.536117
Austin Powers: International Man of Mystery (1997)   0.377433
Sting, The (1973)                                                           0.367538
Indiana Jones and the Last Crusade (1989)                   0.350107
Pinocchio (1940)                                                            0.347868
Frighteners, The (1996)                                                 0.332729
L.A. Confidential (1997)                                                0.319065
Wag the Dog (1997)                                                       0.318645
Dumbo (1941)                                                                0.317656

This shows top 10 most similar movies to Star Wars which should be recommended to the user. The similarity score of Star Wars will obviously 1 because Star Wars is perfectly similar to itself. Hence, the top movie to be recommended to any user who has rated Star Wars is “Empire Strikes Back”.

Here is the full actual item based-CF code:

#REAL ITEM BASED COLLABORATIVE FILTERING

import pandas as pd
import numpy as np

r_cols = ['user_id','movie_id','rating']
ratings = pd.read_csv('u_data_new.csv',sep=',',names=r_cols,usecols=range(3),encoding='ISO-8859-1')

m_cols = ['movie_id', 'title']
movies = pd.read_csv('u.item',sep='|',names=m_cols,usecols=range(2),encoding='ISO-8859-1')

#print(ratings.head())
#print(movies.head())

ratings = pd.merge(movies,ratings,on='movie_id')

userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
#print(userRatings.head())
corrMatrix = userRatings.corr()
#print(corrMatrix.head()) #every movie is correlated with every other movie

corrMatrix = userRatings.corr(method='pearson',min_periods=100)
#print(corrMatrix.head())
myRatings = userRatings.loc[0].dropna() #ratings of 1st user

#Now, let's go through each movie rated by user1 and build up a list of possible recommendations based on the movies similar to the ones rated by the user1.
#So for each movie user1 rated, we will retrieve the list of similar movies from our correlation matrix. We will then scale those correlation scores by how well user1 rated the movie they are similar to, so movies similar to ones liked by user1 count more than movies similar to ones user1 hated.

simCandidates = pd.Series()
for i in range(0,len(myRatings.index)):
print("Adding sims for "+myRatings.index[i] + "...")
sims = corrMatrix[myRatings.index[i]].dropna()
#print(myRatings)
sims = sims.map(lambda x:x*myRatings[i])
#Now scale its similarity by how well user1 rated
simCandidates = simCandidates.append(sims) #Add the score to the list of similarity candidates

print("Sorting...")
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.order(ascending=False,inplace=True)
print(simCandidates.head(10))

#drop movies watched(i.e. rated) by user1
filteredSims = simCandidates.drop(myRatings.index)
print(filteredSims)

 

The output will show all the movies titles along with their similarity scores as per the movies watched by the user1. The output showing top 10 results is as follows:

Return of the Jedi (1983)                                              7.178172
Raiders of the Lost Ark (1981)                                     5.519700
Indiana Jones and the Last Crusade (1989)                  3.488028
Bridge on the River Kwai, The (1957)                         3.366616
Back to the Future (1985)                                             3.357941
Sting, The (1973)                                                          3.329843
Cinderella (1950)                                                          3.245412
Field of Dreams (1989)                                                 3.222311
Wizard of Oz, The (1939)                                             3.200268
Dumbo (1941)                                                               2.981645

This shows that as per the movies rated by user1, the top movie to be recommended to the user1 should be “Return of the Jedi”.

Please note that this is just one way of doing it. We can certainly improve over the results a lot. I hope you got a basic idea of the working of Recommender system. If you want to discuss anything related to this topic you can surely ping me up or comment below.

Thanks

Raghav Chopra

Get Free Posts Updates!

Signup now and receive an email once we publish new content.

I agree to have my personal information transfered to MailChimp ( more information )

I will never give away, trade or sell your email address. You can unsubscribe at any time.

If you liked this, share it with your friends and let them know the power of python !