# Project: Movies Recommendation System 

<b> One significant category of machine learning algorithms that provides consumers with "appropriate" choices is the recommender system. All three sites—YouTube, Amazon, and Netflix—have systems that suggest videos or products to you based on your past behavior (called content-based filtering) or on the behaviors and preferences of other users who have your interests (Collaborative Filtering).</b>

<b>Recommendation Systems work based on the similarity between either the content or the users who access the content.There are several ways to measure the similarity between two items. The recommendation systems use this similarity matrix to recommend the next most similar product to the user.</b>

<b>In this project, we will build a machine learning model that would recommend movies based on a movie the user likes. This Machine Learning model would be based on Cosine Similarity.</b>


## Importing dependencies

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Loading the Data

In [2]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [3]:
movies.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
credits.head(5)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [6]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


## Merging both dataframes : Movies & Credits

In [7]:
movies = movies.merge(credits,on='title')

In [8]:
movies.shape

(4809, 23)

## Data Pre-Processing

## Important columns to be used in recommendation system : 

* genres
* id
* keywords
* title
* overview
* cast
* crew 
     
Extracting these data and creating all the above mentioned features from the given data.

In [9]:
movies = movies[['movie_id','title','overview','genres','cast','keywords','crew']]

In [10]:
movies.head(5)

Unnamed: 0,movie_id,title,overview,genres,cast,keywords,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Missing Values

In [11]:
#Checking for Missing Values
movies.isnull().sum()
     

movie_id    0
title       0
overview    3
genres      0
cast        0
keywords    0
crew        0
dtype: int64

In [12]:
#Dropping the missing values
movies.dropna(inplace=True)

In [13]:
#Checking again after dropping the missing values
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
cast        0
keywords    0
crew        0
dtype: int64

In [14]:
#Checkinf for any duplication in data
movies.duplicated().sum()

0

In [15]:
#checking genres randomly using index position as 0
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [16]:
#AST's(Abstract Syntax Tree (AST)) are mainly used in compilers to check code for their accuracy.
#Because some constructs cannot be represented in context-free grammar, such as implicit typing.
import ast

ast.literal_eval raises an exception if the input isn't a valid Python datatype, so the code won't be executed if it's not.

### Function for extracting values from raw data for the creation of tags

In [17]:
#Extracting genres,keywords from raw data for the creation of tags
#Creating a fuction convert 

def convert(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

### Extracting Genres

In [18]:
#Applying the convert function to genres column to extract the required data
movies['genres'] = movies['genres'].apply(convert)

### Extracting Keywords

In [19]:
#Applying the convert function to keyword column to extract the required data
movies['keywords'] = movies['keywords'].apply(convert)

### Function for extracting top 3 actors from the movie

In [20]:
# Creating a function for extracting top 3 actors from the movie 
    
def convert3(obj):
    L=[]
    counter=0
    for i in ast.literal_eval(obj):
        if counter !=3:
            L.append(i['name'])
            counter+=1
        else:
            break
    return L

In [21]:
#Applying the convert3 function to cast column to extract the required data
movies['cast'] = movies['cast'].apply(convert3)

### Function to fetch the director of movie from crew column

In [22]:
#Creating a function to fetch the director of movie from crew column
def fetch_director(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            L.append(i['name'])
            break
    return L

In [23]:
# Applying the fetch_director function to cast column to extract the required data
movies['crew'] = movies['crew'].apply(fetch_director)
    

In [24]:
#Converting Overviewcolumn data to an array 
movies['overview'] = movies['overview'].apply(lambda x:x.split())

Here, I am trying to replace this kind of answer in my data frame : case_1 case_2 case_3 by : [case_1,case_2,case_3] .apply(lambda x: x.split()) seems to be a good way to do it

### Checking the Data after extracting all the required values

In [25]:
# Checking the Final Data after extracting all the required values
movies.head(5)

Unnamed: 0,movie_id,title,overview,genres,cast,keywords,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[culture clash, future, space war, space colon...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[Johnny Depp, Orlando Bloom, Keira Knightley]","[ocean, drug abuse, exotic island, east india ...",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[spy, based on novel, secret agent, sequel, mi...",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[Christian Bale, Michael Caine, Gary Oldman]","[dc comics, crime fighter, terrorist, secret i...",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[based on novel, mars, medallion, space travel...",[Andrew Stanton]


In [26]:
#Applying a transformation to remove spaces between words 

movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])

In [27]:
# In the tags column inserting all the data to use it to create my recommendation system
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [28]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,cast,keywords,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[SamWorthington, ZoeSaldana, SigourneyWeaver]","[cultureclash, future, spacewar, spacecolony, ...",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[JohnnyDepp, OrlandoBloom, KeiraKnightley]","[ocean, drugabuse, exoticisland, eastindiatrad...",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[DanielCraig, ChristophWaltz, LéaSeydoux]","[spy, basedonnovel, secretagent, sequel, mi6, ...",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[ChristianBale, MichaelCaine, GaryOldman]","[dccomics, crimefighter, terrorist, secretiden...",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[TaylorKitsch, LynnCollins, SamanthaMorton]","[basedonnovel, mars, medallion, spacetravel, p...",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [29]:
#Craeting a new dataframe with 3 columns 
new_df = movies[['movie_id','title','tags']]

In [30]:
# Supressing the warning messages
import warnings
warnings.filterwarnings('ignore')

#Joining all the data togther
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))

In [31]:
#Checking the new data
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [32]:
#Checking one of the tags to check how the data looks
new_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [33]:
#Converting the tags data into lowercase
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

In [34]:
#Checking again after applying the lower case function
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

## Text Vectorization

In [35]:
#Importing this module to convert a collection of text documents to a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
     

In [36]:
#Creating a variable cv to convert text to vector
cv = CountVectorizer(max_features=5000,stop_words='english')

In [37]:
# Transforming the data to vectors and storing as an array
vectors = cv.fit_transform(new_df['tags']).toarray()

In [38]:
## Most frequent 5000 words
# cv.get_feature_names()

## Applying Stemming Process

Stemming is a natural language processing technique that lowers inflection in words to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.Simply put it is reducing the words or chopping the words into their root forms for e.g eating becomes eat and so on. So in stemming there are different stemmers and we are going to discuss PortersStemmer the most popularly used one.

Porters Stemmer It is a type of stemmer which is mainly known for Data Mining and Information Retrieval. As its applications are limited to the English language only. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes, it is also majorly known for its simplicity and speed. The advantage is, it produces the best output from other stemmers and has less error rate.

In [39]:
#Importing the NLTK library for stemming process
import nltk 

In [40]:
#From NLTK import PorterStemmer & then Creating a variable and storing PorterStemmer into it
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
     

In [41]:
#Defining the stemming function
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)
     

In [42]:
#Checking on the sample text
stem('In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron')
     

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'

In [43]:
#Applying the stemming function to the tags column in our new data
new_df['tags'] = new_df['tags'].apply(stem)

## Similarity Measures

Here, in this case-study We will use the Cosine Similarity from Sklearn, as the metric to compute the similarity between two movies.

Cosine similarity is a metric used to measure how similar two items are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The output value ranges from 0–1.

0 means no similarity, where as 1 means that both the items are 100% similar.



In [44]:
#importing the cosine similarity from sklearn
from sklearn.metrics.pairwise import cosine_similarity

In [45]:
#Creating a variable similarity and computing cosine_similarity of the vector
similarity = cosine_similarity(vectors)


## Making the recommendation function

In [46]:
#Creating the function for Movie Recommendation using cosine similarity
def recommend(movie):
    #Get the index from the name of the movie input
    movie_index = new_df[new_df['title'] == movie].index[0] 
    #Generating similar movies
    distances = similarity[movie_index] 
    #Generate a list of similar movies
    #sorting the movies in the list similar_movies. We have used the parameter reverse=True 
    #since we want the list of 5 in the descending order,with the most similar item at the top
    movies_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:6] 
    
    
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

## Recommendation

In [47]:
#Enter movies only which are in the dataset, otherwise it would result in error
recommend('Batman Begins')  

The Dark Knight
The Dark Knight Rises
Batman
Batman & Robin
Batman


In [48]:
new_df.iloc[1216]

movie_id                                                  440
title                             Aliens vs Predator: Requiem
tags        a sequel to 2004' alien vs. predator, the icon...
Name: 1216, dtype: object

## Exporting the Model

In [49]:
import pickle

In [50]:
pickle.dump(new_df,open('movies.pkl','wb'))

In [51]:
pickle.dump(new_df.to_dict(),open('movie_dict.pkl','wb'))
     

In [52]:
pickle.dump(similarity,open('similarity.pkl','wb'))
