Similarity analysis is a common task in Natural Language Processing(NLP). YouTube or Netflix use similar techniques to recommend to their customers. They analyze the previous behavior of their customers and based on that, they recommend similar material for them. In this article, I will discuss how to develop a movie recommendation model using the scikit-learn library in python. I will also use a similarity analysis technique. It involves a lot of complex mathematics. But the scikit-learn library has some great in-built functions that will take care of most of the heavy lifting. I will explain how to use those functions and their job as we move forward with the exercise.
I will use a movie dataset for this exercise. I am giving the link to the dataset at the bottom of this page. Please feel free to download and run all the code for better understanding.
Here is the step by step implementation of the movie recommendation model:
- Import the packages and the dataset.
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similaritydf = pd.read_csv("movie_dataset.csv")
2. The dataset is too big. So, I cannot show a screenshot here. Here are the columns of the dataset. They are self-explanatory. Column names will tell what is the content in it.
Index(['index', 'budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count', 'cast', 'crew', 'director'], dtype='object')
3. Choose the features to be used for the model. We do not need to use all the features. Some of them are not appropriate for this model. I choose these four features:
features = ['keywords','cast','genres','director']
Please feel free to include more features or different features for the experiment. Now, combine those features and make one column out of those four columns.
return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']
4. If we have null values, that may create problems later on in the algorithm. Fill the null values with empty strings.
for feature in features:
df[feature] = df[feature].fillna('')df["combined_features"] = df.apply(combine_features,axis=1)
5. Fit and transform the data into the ‘count vectorizer’ function that prepares the data for the vector representation. When you pass the text data through the ‘count vectorizer’ function, it returns a matrix of the number count of each word.
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combined_features"])
6. Use ‘cosine_similarity’ to find the similarity. This is a dynamic way of finding the similarity that measures the cosine angle between two vectors in a multi-dimensional space. In this way, the size of the documents does not matter. The documents could be far apart by the Euclidean distance but their cosine angle can be similar.
cosine_sim = cosine_similarity(count_matrix)
This ‘cosine_sim’ is a two-dimensional matrix and it’s a coefficient matrix. I am not going to explain the cosine_similarity in detail here because that is out of the scope of this article. I wanted to show, how to use it.
7. We need to define two functions. One of the functions returns the title from the index and the other one returns the index from the title. We will use both functions soon.
return df[df.index == index]["title"].values
return df[df.title == title]["index"].values
8. Take a movie that our user likes. Let’s take ‘Star Wars’. Then find the index of this movie using the function above.
movie = "Star Wars"
movie_index = find_index_from_title(movie)
The index of ‘Star Wars’ is 2912. As I mentioned earlier that ‘cosine_sim’ in step 6 is a matrix of the similarity coefficients. Row 2912 of that matrix should provide the similarity coefficients of all the movies with ‘Star Wars’. So, find row 2912 of the matrix ‘cosine_sim’.
similar_movies = list(enumerate(cosine_sim[movie_index]))
I am using enumerate to get the index and the coefficients. ‘similar_movies’ is a list of tuples that contains index and coefficients. I love using the ‘enumerate’ in python. It comes in handy a lot of time.
9. Sort the list ‘similar_movies’ by the coefficients in the reverse order. That way, the highest coefficients will be on top.
sorted_similar_movies = sorted(similar_movies,key=lambda x:x,reverse=True)[1:]
We are not taking the first one from the list because the top one in the list will be ‘Star Wars’ itself.
10. Use the function ‘find_title_from_index’ to get the top five similar movies to the ‘Star Wars’.
for element in sorted_similar_movies:
The top five similar movies to ‘Star Wars’ are:
The Empire Strikes Back
Return of the Jedi
Star Wars: Episode II — Attack of the Clones
Star Wars: Episode III — Revenge of the Sith
Star Wars: Episode I — The Phantom Menace
The Helix… Loaded
Make sense, right? So, this was the movie recommendation model. Please feel free to ask me if you have any questions.
#NaturalLanguageProcessing #machinelearning #datascience #scikitlearn #python #cosinesimilarity #recommendationmodel