SagarBapodara commited on
Commit
de5ec86
1 Parent(s): a2d2c1f

Added Files

Browse files
MovieRecommendation_AppImage.jpg ADDED
MovieRecommendation_Model.ipynb ADDED
@@ -0,0 +1,1846 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "c5802a21",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Project: Movies Recommendation System \n",
9
+ "\n",
10
+ "<b> One significant category of machine learning algorithms that provides consumers with \"appropriate\" choices is the recommender system. All three sites—YouTube, Amazon, and Netflix—have systems that suggest videos or products to you based on your past behavior (called content-based filtering) or on the behaviors and preferences of other users who have your interests (Collaborative Filtering).</b>\n",
11
+ "\n",
12
+ "<b>Recommendation Systems work based on the similarity between either the content or the users who access the content.There are several ways to measure the similarity between two items. The recommendation systems use this similarity matrix to recommend the next most similar product to the user.</b>\n",
13
+ "\n",
14
+ "<b>In this project, we will build a machine learning model that would recommend movies based on a movie the user likes. This Machine Learning model would be based on Cosine Similarity.</b>\n"
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "markdown",
19
+ "id": "45208056",
20
+ "metadata": {},
21
+ "source": [
22
+ "## Importing dependencies"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "code",
27
+ "execution_count": 1,
28
+ "id": "fb34fc04",
29
+ "metadata": {},
30
+ "outputs": [],
31
+ "source": [
32
+ "import os\n",
33
+ "import numpy as np\n",
34
+ "import pandas as pd\n",
35
+ "import matplotlib.pyplot as plt\n",
36
+ "import seaborn as sns"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "markdown",
41
+ "id": "60a47b8f",
42
+ "metadata": {},
43
+ "source": [
44
+ "## Loading the Data"
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "code",
49
+ "execution_count": 2,
50
+ "id": "22801041",
51
+ "metadata": {},
52
+ "outputs": [],
53
+ "source": [
54
+ "movies = pd.read_csv('tmdb_5000_movies.csv')\n",
55
+ "credits = pd.read_csv('tmdb_5000_credits.csv')"
56
+ ]
57
+ },
58
+ {
59
+ "cell_type": "code",
60
+ "execution_count": 3,
61
+ "id": "365e7f1f",
62
+ "metadata": {},
63
+ "outputs": [
64
+ {
65
+ "data": {
66
+ "text/html": [
67
+ "<div>\n",
68
+ "<style scoped>\n",
69
+ " .dataframe tbody tr th:only-of-type {\n",
70
+ " vertical-align: middle;\n",
71
+ " }\n",
72
+ "\n",
73
+ " .dataframe tbody tr th {\n",
74
+ " vertical-align: top;\n",
75
+ " }\n",
76
+ "\n",
77
+ " .dataframe thead th {\n",
78
+ " text-align: right;\n",
79
+ " }\n",
80
+ "</style>\n",
81
+ "<table border=\"1\" class=\"dataframe\">\n",
82
+ " <thead>\n",
83
+ " <tr style=\"text-align: right;\">\n",
84
+ " <th></th>\n",
85
+ " <th>budget</th>\n",
86
+ " <th>genres</th>\n",
87
+ " <th>homepage</th>\n",
88
+ " <th>id</th>\n",
89
+ " <th>keywords</th>\n",
90
+ " <th>original_language</th>\n",
91
+ " <th>original_title</th>\n",
92
+ " <th>overview</th>\n",
93
+ " <th>popularity</th>\n",
94
+ " <th>production_companies</th>\n",
95
+ " <th>production_countries</th>\n",
96
+ " <th>release_date</th>\n",
97
+ " <th>revenue</th>\n",
98
+ " <th>runtime</th>\n",
99
+ " <th>spoken_languages</th>\n",
100
+ " <th>status</th>\n",
101
+ " <th>tagline</th>\n",
102
+ " <th>title</th>\n",
103
+ " <th>vote_average</th>\n",
104
+ " <th>vote_count</th>\n",
105
+ " </tr>\n",
106
+ " </thead>\n",
107
+ " <tbody>\n",
108
+ " <tr>\n",
109
+ " <th>0</th>\n",
110
+ " <td>237000000</td>\n",
111
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam...</td>\n",
112
+ " <td>http://www.avatarmovie.com/</td>\n",
113
+ " <td>19995</td>\n",
114
+ " <td>[{\"id\": 1463, \"name\": \"culture clash\"}, {\"id\":...</td>\n",
115
+ " <td>en</td>\n",
116
+ " <td>Avatar</td>\n",
117
+ " <td>In the 22nd century, a paraplegic Marine is di...</td>\n",
118
+ " <td>150.437577</td>\n",
119
+ " <td>[{\"name\": \"Ingenious Film Partners\", \"id\": 289...</td>\n",
120
+ " <td>[{\"iso_3166_1\": \"US\", \"name\": \"United States o...</td>\n",
121
+ " <td>2009-12-10</td>\n",
122
+ " <td>2787965087</td>\n",
123
+ " <td>162.0</td>\n",
124
+ " <td>[{\"iso_639_1\": \"en\", \"name\": \"English\"}, {\"iso...</td>\n",
125
+ " <td>Released</td>\n",
126
+ " <td>Enter the World of Pandora.</td>\n",
127
+ " <td>Avatar</td>\n",
128
+ " <td>7.2</td>\n",
129
+ " <td>11800</td>\n",
130
+ " </tr>\n",
131
+ " <tr>\n",
132
+ " <th>1</th>\n",
133
+ " <td>300000000</td>\n",
134
+ " <td>[{\"id\": 12, \"name\": \"Adventure\"}, {\"id\": 14, \"...</td>\n",
135
+ " <td>http://disney.go.com/disneypictures/pirates/</td>\n",
136
+ " <td>285</td>\n",
137
+ " <td>[{\"id\": 270, \"name\": \"ocean\"}, {\"id\": 726, \"na...</td>\n",
138
+ " <td>en</td>\n",
139
+ " <td>Pirates of the Caribbean: At World's End</td>\n",
140
+ " <td>Captain Barbossa, long believed to be dead, ha...</td>\n",
141
+ " <td>139.082615</td>\n",
142
+ " <td>[{\"name\": \"Walt Disney Pictures\", \"id\": 2}, {\"...</td>\n",
143
+ " <td>[{\"iso_3166_1\": \"US\", \"name\": \"United States o...</td>\n",
144
+ " <td>2007-05-19</td>\n",
145
+ " <td>961000000</td>\n",
146
+ " <td>169.0</td>\n",
147
+ " <td>[{\"iso_639_1\": \"en\", \"name\": \"English\"}]</td>\n",
148
+ " <td>Released</td>\n",
149
+ " <td>At the end of the world, the adventure begins.</td>\n",
150
+ " <td>Pirates of the Caribbean: At World's End</td>\n",
151
+ " <td>6.9</td>\n",
152
+ " <td>4500</td>\n",
153
+ " </tr>\n",
154
+ " <tr>\n",
155
+ " <th>2</th>\n",
156
+ " <td>245000000</td>\n",
157
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam...</td>\n",
158
+ " <td>http://www.sonypictures.com/movies/spectre/</td>\n",
159
+ " <td>206647</td>\n",
160
+ " <td>[{\"id\": 470, \"name\": \"spy\"}, {\"id\": 818, \"name...</td>\n",
161
+ " <td>en</td>\n",
162
+ " <td>Spectre</td>\n",
163
+ " <td>A cryptic message from Bond’s past sends him o...</td>\n",
164
+ " <td>107.376788</td>\n",
165
+ " <td>[{\"name\": \"Columbia Pictures\", \"id\": 5}, {\"nam...</td>\n",
166
+ " <td>[{\"iso_3166_1\": \"GB\", \"name\": \"United Kingdom\"...</td>\n",
167
+ " <td>2015-10-26</td>\n",
168
+ " <td>880674609</td>\n",
169
+ " <td>148.0</td>\n",
170
+ " <td>[{\"iso_639_1\": \"fr\", \"name\": \"Fran\\u00e7ais\"},...</td>\n",
171
+ " <td>Released</td>\n",
172
+ " <td>A Plan No One Escapes</td>\n",
173
+ " <td>Spectre</td>\n",
174
+ " <td>6.3</td>\n",
175
+ " <td>4466</td>\n",
176
+ " </tr>\n",
177
+ " <tr>\n",
178
+ " <th>3</th>\n",
179
+ " <td>250000000</td>\n",
180
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 80, \"nam...</td>\n",
181
+ " <td>http://www.thedarkknightrises.com/</td>\n",
182
+ " <td>49026</td>\n",
183
+ " <td>[{\"id\": 849, \"name\": \"dc comics\"}, {\"id\": 853,...</td>\n",
184
+ " <td>en</td>\n",
185
+ " <td>The Dark Knight Rises</td>\n",
186
+ " <td>Following the death of District Attorney Harve...</td>\n",
187
+ " <td>112.312950</td>\n",
188
+ " <td>[{\"name\": \"Legendary Pictures\", \"id\": 923}, {\"...</td>\n",
189
+ " <td>[{\"iso_3166_1\": \"US\", \"name\": \"United States o...</td>\n",
190
+ " <td>2012-07-16</td>\n",
191
+ " <td>1084939099</td>\n",
192
+ " <td>165.0</td>\n",
193
+ " <td>[{\"iso_639_1\": \"en\", \"name\": \"English\"}]</td>\n",
194
+ " <td>Released</td>\n",
195
+ " <td>The Legend Ends</td>\n",
196
+ " <td>The Dark Knight Rises</td>\n",
197
+ " <td>7.6</td>\n",
198
+ " <td>9106</td>\n",
199
+ " </tr>\n",
200
+ " <tr>\n",
201
+ " <th>4</th>\n",
202
+ " <td>260000000</td>\n",
203
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam...</td>\n",
204
+ " <td>http://movies.disney.com/john-carter</td>\n",
205
+ " <td>49529</td>\n",
206
+ " <td>[{\"id\": 818, \"name\": \"based on novel\"}, {\"id\":...</td>\n",
207
+ " <td>en</td>\n",
208
+ " <td>John Carter</td>\n",
209
+ " <td>John Carter is a war-weary, former military ca...</td>\n",
210
+ " <td>43.926995</td>\n",
211
+ " <td>[{\"name\": \"Walt Disney Pictures\", \"id\": 2}]</td>\n",
212
+ " <td>[{\"iso_3166_1\": \"US\", \"name\": \"United States o...</td>\n",
213
+ " <td>2012-03-07</td>\n",
214
+ " <td>284139100</td>\n",
215
+ " <td>132.0</td>\n",
216
+ " <td>[{\"iso_639_1\": \"en\", \"name\": \"English\"}]</td>\n",
217
+ " <td>Released</td>\n",
218
+ " <td>Lost in our world, found in another.</td>\n",
219
+ " <td>John Carter</td>\n",
220
+ " <td>6.1</td>\n",
221
+ " <td>2124</td>\n",
222
+ " </tr>\n",
223
+ " </tbody>\n",
224
+ "</table>\n",
225
+ "</div>"
226
+ ],
227
+ "text/plain": [
228
+ " budget genres \\\n",
229
+ "0 237000000 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam... \n",
230
+ "1 300000000 [{\"id\": 12, \"name\": \"Adventure\"}, {\"id\": 14, \"... \n",
231
+ "2 245000000 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam... \n",
232
+ "3 250000000 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 80, \"nam... \n",
233
+ "4 260000000 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam... \n",
234
+ "\n",
235
+ " homepage id \\\n",
236
+ "0 http://www.avatarmovie.com/ 19995 \n",
237
+ "1 http://disney.go.com/disneypictures/pirates/ 285 \n",
238
+ "2 http://www.sonypictures.com/movies/spectre/ 206647 \n",
239
+ "3 http://www.thedarkknightrises.com/ 49026 \n",
240
+ "4 http://movies.disney.com/john-carter 49529 \n",
241
+ "\n",
242
+ " keywords original_language \\\n",
243
+ "0 [{\"id\": 1463, \"name\": \"culture clash\"}, {\"id\":... en \n",
244
+ "1 [{\"id\": 270, \"name\": \"ocean\"}, {\"id\": 726, \"na... en \n",
245
+ "2 [{\"id\": 470, \"name\": \"spy\"}, {\"id\": 818, \"name... en \n",
246
+ "3 [{\"id\": 849, \"name\": \"dc comics\"}, {\"id\": 853,... en \n",
247
+ "4 [{\"id\": 818, \"name\": \"based on novel\"}, {\"id\":... en \n",
248
+ "\n",
249
+ " original_title \\\n",
250
+ "0 Avatar \n",
251
+ "1 Pirates of the Caribbean: At World's End \n",
252
+ "2 Spectre \n",
253
+ "3 The Dark Knight Rises \n",
254
+ "4 John Carter \n",
255
+ "\n",
256
+ " overview popularity \\\n",
257
+ "0 In the 22nd century, a paraplegic Marine is di... 150.437577 \n",
258
+ "1 Captain Barbossa, long believed to be dead, ha... 139.082615 \n",
259
+ "2 A cryptic message from Bond’s past sends him o... 107.376788 \n",
260
+ "3 Following the death of District Attorney Harve... 112.312950 \n",
261
+ "4 John Carter is a war-weary, former military ca... 43.926995 \n",
262
+ "\n",
263
+ " production_companies \\\n",
264
+ "0 [{\"name\": \"Ingenious Film Partners\", \"id\": 289... \n",
265
+ "1 [{\"name\": \"Walt Disney Pictures\", \"id\": 2}, {\"... \n",
266
+ "2 [{\"name\": \"Columbia Pictures\", \"id\": 5}, {\"nam... \n",
267
+ "3 [{\"name\": \"Legendary Pictures\", \"id\": 923}, {\"... \n",
268
+ "4 [{\"name\": \"Walt Disney Pictures\", \"id\": 2}] \n",
269
+ "\n",
270
+ " production_countries release_date revenue \\\n",
271
+ "0 [{\"iso_3166_1\": \"US\", \"name\": \"United States o... 2009-12-10 2787965087 \n",
272
+ "1 [{\"iso_3166_1\": \"US\", \"name\": \"United States o... 2007-05-19 961000000 \n",
273
+ "2 [{\"iso_3166_1\": \"GB\", \"name\": \"United Kingdom\"... 2015-10-26 880674609 \n",
274
+ "3 [{\"iso_3166_1\": \"US\", \"name\": \"United States o... 2012-07-16 1084939099 \n",
275
+ "4 [{\"iso_3166_1\": \"US\", \"name\": \"United States o... 2012-03-07 284139100 \n",
276
+ "\n",
277
+ " runtime spoken_languages status \\\n",
278
+ "0 162.0 [{\"iso_639_1\": \"en\", \"name\": \"English\"}, {\"iso... Released \n",
279
+ "1 169.0 [{\"iso_639_1\": \"en\", \"name\": \"English\"}] Released \n",
280
+ "2 148.0 [{\"iso_639_1\": \"fr\", \"name\": \"Fran\\u00e7ais\"},... Released \n",
281
+ "3 165.0 [{\"iso_639_1\": \"en\", \"name\": \"English\"}] Released \n",
282
+ "4 132.0 [{\"iso_639_1\": \"en\", \"name\": \"English\"}] Released \n",
283
+ "\n",
284
+ " tagline \\\n",
285
+ "0 Enter the World of Pandora. \n",
286
+ "1 At the end of the world, the adventure begins. \n",
287
+ "2 A Plan No One Escapes \n",
288
+ "3 The Legend Ends \n",
289
+ "4 Lost in our world, found in another. \n",
290
+ "\n",
291
+ " title vote_average vote_count \n",
292
+ "0 Avatar 7.2 11800 \n",
293
+ "1 Pirates of the Caribbean: At World's End 6.9 4500 \n",
294
+ "2 Spectre 6.3 4466 \n",
295
+ "3 The Dark Knight Rises 7.6 9106 \n",
296
+ "4 John Carter 6.1 2124 "
297
+ ]
298
+ },
299
+ "execution_count": 3,
300
+ "metadata": {},
301
+ "output_type": "execute_result"
302
+ }
303
+ ],
304
+ "source": [
305
+ "movies.head(5)"
306
+ ]
307
+ },
308
+ {
309
+ "cell_type": "code",
310
+ "execution_count": 4,
311
+ "id": "f161a85b",
312
+ "metadata": {},
313
+ "outputs": [
314
+ {
315
+ "data": {
316
+ "text/html": [
317
+ "<div>\n",
318
+ "<style scoped>\n",
319
+ " .dataframe tbody tr th:only-of-type {\n",
320
+ " vertical-align: middle;\n",
321
+ " }\n",
322
+ "\n",
323
+ " .dataframe tbody tr th {\n",
324
+ " vertical-align: top;\n",
325
+ " }\n",
326
+ "\n",
327
+ " .dataframe thead th {\n",
328
+ " text-align: right;\n",
329
+ " }\n",
330
+ "</style>\n",
331
+ "<table border=\"1\" class=\"dataframe\">\n",
332
+ " <thead>\n",
333
+ " <tr style=\"text-align: right;\">\n",
334
+ " <th></th>\n",
335
+ " <th>movie_id</th>\n",
336
+ " <th>title</th>\n",
337
+ " <th>cast</th>\n",
338
+ " <th>crew</th>\n",
339
+ " </tr>\n",
340
+ " </thead>\n",
341
+ " <tbody>\n",
342
+ " <tr>\n",
343
+ " <th>0</th>\n",
344
+ " <td>19995</td>\n",
345
+ " <td>Avatar</td>\n",
346
+ " <td>[{\"cast_id\": 242, \"character\": \"Jake Sully\", \"...</td>\n",
347
+ " <td>[{\"credit_id\": \"52fe48009251416c750aca23\", \"de...</td>\n",
348
+ " </tr>\n",
349
+ " <tr>\n",
350
+ " <th>1</th>\n",
351
+ " <td>285</td>\n",
352
+ " <td>Pirates of the Caribbean: At World's End</td>\n",
353
+ " <td>[{\"cast_id\": 4, \"character\": \"Captain Jack Spa...</td>\n",
354
+ " <td>[{\"credit_id\": \"52fe4232c3a36847f800b579\", \"de...</td>\n",
355
+ " </tr>\n",
356
+ " <tr>\n",
357
+ " <th>2</th>\n",
358
+ " <td>206647</td>\n",
359
+ " <td>Spectre</td>\n",
360
+ " <td>[{\"cast_id\": 1, \"character\": \"James Bond\", \"cr...</td>\n",
361
+ " <td>[{\"credit_id\": \"54805967c3a36829b5002c41\", \"de...</td>\n",
362
+ " </tr>\n",
363
+ " <tr>\n",
364
+ " <th>3</th>\n",
365
+ " <td>49026</td>\n",
366
+ " <td>The Dark Knight Rises</td>\n",
367
+ " <td>[{\"cast_id\": 2, \"character\": \"Bruce Wayne / Ba...</td>\n",
368
+ " <td>[{\"credit_id\": \"52fe4781c3a36847f81398c3\", \"de...</td>\n",
369
+ " </tr>\n",
370
+ " <tr>\n",
371
+ " <th>4</th>\n",
372
+ " <td>49529</td>\n",
373
+ " <td>John Carter</td>\n",
374
+ " <td>[{\"cast_id\": 5, \"character\": \"John Carter\", \"c...</td>\n",
375
+ " <td>[{\"credit_id\": \"52fe479ac3a36847f813eaa3\", \"de...</td>\n",
376
+ " </tr>\n",
377
+ " </tbody>\n",
378
+ "</table>\n",
379
+ "</div>"
380
+ ],
381
+ "text/plain": [
382
+ " movie_id title \\\n",
383
+ "0 19995 Avatar \n",
384
+ "1 285 Pirates of the Caribbean: At World's End \n",
385
+ "2 206647 Spectre \n",
386
+ "3 49026 The Dark Knight Rises \n",
387
+ "4 49529 John Carter \n",
388
+ "\n",
389
+ " cast \\\n",
390
+ "0 [{\"cast_id\": 242, \"character\": \"Jake Sully\", \"... \n",
391
+ "1 [{\"cast_id\": 4, \"character\": \"Captain Jack Spa... \n",
392
+ "2 [{\"cast_id\": 1, \"character\": \"James Bond\", \"cr... \n",
393
+ "3 [{\"cast_id\": 2, \"character\": \"Bruce Wayne / Ba... \n",
394
+ "4 [{\"cast_id\": 5, \"character\": \"John Carter\", \"c... \n",
395
+ "\n",
396
+ " crew \n",
397
+ "0 [{\"credit_id\": \"52fe48009251416c750aca23\", \"de... \n",
398
+ "1 [{\"credit_id\": \"52fe4232c3a36847f800b579\", \"de... \n",
399
+ "2 [{\"credit_id\": \"54805967c3a36829b5002c41\", \"de... \n",
400
+ "3 [{\"credit_id\": \"52fe4781c3a36847f81398c3\", \"de... \n",
401
+ "4 [{\"credit_id\": \"52fe479ac3a36847f813eaa3\", \"de... "
402
+ ]
403
+ },
404
+ "execution_count": 4,
405
+ "metadata": {},
406
+ "output_type": "execute_result"
407
+ }
408
+ ],
409
+ "source": [
410
+ "credits.head(5)"
411
+ ]
412
+ },
413
+ {
414
+ "cell_type": "code",
415
+ "execution_count": 5,
416
+ "id": "850bed94",
417
+ "metadata": {},
418
+ "outputs": [
419
+ {
420
+ "name": "stdout",
421
+ "output_type": "stream",
422
+ "text": [
423
+ "<class 'pandas.core.frame.DataFrame'>\n",
424
+ "RangeIndex: 4803 entries, 0 to 4802\n",
425
+ "Data columns (total 20 columns):\n",
426
+ " # Column Non-Null Count Dtype \n",
427
+ "--- ------ -------------- ----- \n",
428
+ " 0 budget 4803 non-null int64 \n",
429
+ " 1 genres 4803 non-null object \n",
430
+ " 2 homepage 1712 non-null object \n",
431
+ " 3 id 4803 non-null int64 \n",
432
+ " 4 keywords 4803 non-null object \n",
433
+ " 5 original_language 4803 non-null object \n",
434
+ " 6 original_title 4803 non-null object \n",
435
+ " 7 overview 4800 non-null object \n",
436
+ " 8 popularity 4803 non-null float64\n",
437
+ " 9 production_companies 4803 non-null object \n",
438
+ " 10 production_countries 4803 non-null object \n",
439
+ " 11 release_date 4802 non-null object \n",
440
+ " 12 revenue 4803 non-null int64 \n",
441
+ " 13 runtime 4801 non-null float64\n",
442
+ " 14 spoken_languages 4803 non-null object \n",
443
+ " 15 status 4803 non-null object \n",
444
+ " 16 tagline 3959 non-null object \n",
445
+ " 17 title 4803 non-null object \n",
446
+ " 18 vote_average 4803 non-null float64\n",
447
+ " 19 vote_count 4803 non-null int64 \n",
448
+ "dtypes: float64(3), int64(4), object(13)\n",
449
+ "memory usage: 750.6+ KB\n"
450
+ ]
451
+ }
452
+ ],
453
+ "source": [
454
+ "movies.info()"
455
+ ]
456
+ },
457
+ {
458
+ "cell_type": "code",
459
+ "execution_count": 6,
460
+ "id": "a6859a8a",
461
+ "metadata": {},
462
+ "outputs": [
463
+ {
464
+ "name": "stdout",
465
+ "output_type": "stream",
466
+ "text": [
467
+ "<class 'pandas.core.frame.DataFrame'>\n",
468
+ "RangeIndex: 4803 entries, 0 to 4802\n",
469
+ "Data columns (total 4 columns):\n",
470
+ " # Column Non-Null Count Dtype \n",
471
+ "--- ------ -------------- ----- \n",
472
+ " 0 movie_id 4803 non-null int64 \n",
473
+ " 1 title 4803 non-null object\n",
474
+ " 2 cast 4803 non-null object\n",
475
+ " 3 crew 4803 non-null object\n",
476
+ "dtypes: int64(1), object(3)\n",
477
+ "memory usage: 150.2+ KB\n"
478
+ ]
479
+ }
480
+ ],
481
+ "source": [
482
+ "credits.info()"
483
+ ]
484
+ },
485
+ {
486
+ "cell_type": "markdown",
487
+ "id": "1f1a4038",
488
+ "metadata": {},
489
+ "source": [
490
+ "## Merging both dataframes : Movies & Credits"
491
+ ]
492
+ },
493
+ {
494
+ "cell_type": "code",
495
+ "execution_count": 7,
496
+ "id": "26168071",
497
+ "metadata": {},
498
+ "outputs": [],
499
+ "source": [
500
+ "movies = movies.merge(credits,on='title')"
501
+ ]
502
+ },
503
+ {
504
+ "cell_type": "code",
505
+ "execution_count": 8,
506
+ "id": "28d61e2b",
507
+ "metadata": {},
508
+ "outputs": [
509
+ {
510
+ "data": {
511
+ "text/plain": [
512
+ "(4809, 23)"
513
+ ]
514
+ },
515
+ "execution_count": 8,
516
+ "metadata": {},
517
+ "output_type": "execute_result"
518
+ }
519
+ ],
520
+ "source": [
521
+ "movies.shape"
522
+ ]
523
+ },
524
+ {
525
+ "cell_type": "markdown",
526
+ "id": "958ed596",
527
+ "metadata": {},
528
+ "source": [
529
+ "## Data Pre-Processing"
530
+ ]
531
+ },
532
+ {
533
+ "cell_type": "markdown",
534
+ "id": "6cca2e2b",
535
+ "metadata": {},
536
+ "source": [
537
+ "## Important columns to be used in recommendation system : \n",
538
+ "\n",
539
+ "* genres\n",
540
+ "* id\n",
541
+ "* keywords\n",
542
+ "* title\n",
543
+ "* overview\n",
544
+ "* cast\n",
545
+ "* crew \n",
546
+ " \n",
547
+ "Extracting these data and creating all the above mentioned features from the given data."
548
+ ]
549
+ },
550
+ {
551
+ "cell_type": "code",
552
+ "execution_count": 9,
553
+ "id": "0cb8d18f",
554
+ "metadata": {},
555
+ "outputs": [],
556
+ "source": [
557
+ "movies = movies[['movie_id','title','overview','genres','cast','keywords','crew']]"
558
+ ]
559
+ },
560
+ {
561
+ "cell_type": "code",
562
+ "execution_count": 10,
563
+ "id": "7cbdf69e",
564
+ "metadata": {},
565
+ "outputs": [
566
+ {
567
+ "data": {
568
+ "text/html": [
569
+ "<div>\n",
570
+ "<style scoped>\n",
571
+ " .dataframe tbody tr th:only-of-type {\n",
572
+ " vertical-align: middle;\n",
573
+ " }\n",
574
+ "\n",
575
+ " .dataframe tbody tr th {\n",
576
+ " vertical-align: top;\n",
577
+ " }\n",
578
+ "\n",
579
+ " .dataframe thead th {\n",
580
+ " text-align: right;\n",
581
+ " }\n",
582
+ "</style>\n",
583
+ "<table border=\"1\" class=\"dataframe\">\n",
584
+ " <thead>\n",
585
+ " <tr style=\"text-align: right;\">\n",
586
+ " <th></th>\n",
587
+ " <th>movie_id</th>\n",
588
+ " <th>title</th>\n",
589
+ " <th>overview</th>\n",
590
+ " <th>genres</th>\n",
591
+ " <th>cast</th>\n",
592
+ " <th>keywords</th>\n",
593
+ " <th>crew</th>\n",
594
+ " </tr>\n",
595
+ " </thead>\n",
596
+ " <tbody>\n",
597
+ " <tr>\n",
598
+ " <th>0</th>\n",
599
+ " <td>19995</td>\n",
600
+ " <td>Avatar</td>\n",
601
+ " <td>In the 22nd century, a paraplegic Marine is di...</td>\n",
602
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam...</td>\n",
603
+ " <td>[{\"cast_id\": 242, \"character\": \"Jake Sully\", \"...</td>\n",
604
+ " <td>[{\"id\": 1463, \"name\": \"culture clash\"}, {\"id\":...</td>\n",
605
+ " <td>[{\"credit_id\": \"52fe48009251416c750aca23\", \"de...</td>\n",
606
+ " </tr>\n",
607
+ " <tr>\n",
608
+ " <th>1</th>\n",
609
+ " <td>285</td>\n",
610
+ " <td>Pirates of the Caribbean: At World's End</td>\n",
611
+ " <td>Captain Barbossa, long believed to be dead, ha...</td>\n",
612
+ " <td>[{\"id\": 12, \"name\": \"Adventure\"}, {\"id\": 14, \"...</td>\n",
613
+ " <td>[{\"cast_id\": 4, \"character\": \"Captain Jack Spa...</td>\n",
614
+ " <td>[{\"id\": 270, \"name\": \"ocean\"}, {\"id\": 726, \"na...</td>\n",
615
+ " <td>[{\"credit_id\": \"52fe4232c3a36847f800b579\", \"de...</td>\n",
616
+ " </tr>\n",
617
+ " <tr>\n",
618
+ " <th>2</th>\n",
619
+ " <td>206647</td>\n",
620
+ " <td>Spectre</td>\n",
621
+ " <td>A cryptic message from Bond’s past sends him o...</td>\n",
622
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam...</td>\n",
623
+ " <td>[{\"cast_id\": 1, \"character\": \"James Bond\", \"cr...</td>\n",
624
+ " <td>[{\"id\": 470, \"name\": \"spy\"}, {\"id\": 818, \"name...</td>\n",
625
+ " <td>[{\"credit_id\": \"54805967c3a36829b5002c41\", \"de...</td>\n",
626
+ " </tr>\n",
627
+ " <tr>\n",
628
+ " <th>3</th>\n",
629
+ " <td>49026</td>\n",
630
+ " <td>The Dark Knight Rises</td>\n",
631
+ " <td>Following the death of District Attorney Harve...</td>\n",
632
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 80, \"nam...</td>\n",
633
+ " <td>[{\"cast_id\": 2, \"character\": \"Bruce Wayne / Ba...</td>\n",
634
+ " <td>[{\"id\": 849, \"name\": \"dc comics\"}, {\"id\": 853,...</td>\n",
635
+ " <td>[{\"credit_id\": \"52fe4781c3a36847f81398c3\", \"de...</td>\n",
636
+ " </tr>\n",
637
+ " <tr>\n",
638
+ " <th>4</th>\n",
639
+ " <td>49529</td>\n",
640
+ " <td>John Carter</td>\n",
641
+ " <td>John Carter is a war-weary, former military ca...</td>\n",
642
+ " <td>[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam...</td>\n",
643
+ " <td>[{\"cast_id\": 5, \"character\": \"John Carter\", \"c...</td>\n",
644
+ " <td>[{\"id\": 818, \"name\": \"based on novel\"}, {\"id\":...</td>\n",
645
+ " <td>[{\"credit_id\": \"52fe479ac3a36847f813eaa3\", \"de...</td>\n",
646
+ " </tr>\n",
647
+ " </tbody>\n",
648
+ "</table>\n",
649
+ "</div>"
650
+ ],
651
+ "text/plain": [
652
+ " movie_id title \\\n",
653
+ "0 19995 Avatar \n",
654
+ "1 285 Pirates of the Caribbean: At World's End \n",
655
+ "2 206647 Spectre \n",
656
+ "3 49026 The Dark Knight Rises \n",
657
+ "4 49529 John Carter \n",
658
+ "\n",
659
+ " overview \\\n",
660
+ "0 In the 22nd century, a paraplegic Marine is di... \n",
661
+ "1 Captain Barbossa, long believed to be dead, ha... \n",
662
+ "2 A cryptic message from Bond’s past sends him o... \n",
663
+ "3 Following the death of District Attorney Harve... \n",
664
+ "4 John Carter is a war-weary, former military ca... \n",
665
+ "\n",
666
+ " genres \\\n",
667
+ "0 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam... \n",
668
+ "1 [{\"id\": 12, \"name\": \"Adventure\"}, {\"id\": 14, \"... \n",
669
+ "2 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam... \n",
670
+ "3 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 80, \"nam... \n",
671
+ "4 [{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"nam... \n",
672
+ "\n",
673
+ " cast \\\n",
674
+ "0 [{\"cast_id\": 242, \"character\": \"Jake Sully\", \"... \n",
675
+ "1 [{\"cast_id\": 4, \"character\": \"Captain Jack Spa... \n",
676
+ "2 [{\"cast_id\": 1, \"character\": \"James Bond\", \"cr... \n",
677
+ "3 [{\"cast_id\": 2, \"character\": \"Bruce Wayne / Ba... \n",
678
+ "4 [{\"cast_id\": 5, \"character\": \"John Carter\", \"c... \n",
679
+ "\n",
680
+ " keywords \\\n",
681
+ "0 [{\"id\": 1463, \"name\": \"culture clash\"}, {\"id\":... \n",
682
+ "1 [{\"id\": 270, \"name\": \"ocean\"}, {\"id\": 726, \"na... \n",
683
+ "2 [{\"id\": 470, \"name\": \"spy\"}, {\"id\": 818, \"name... \n",
684
+ "3 [{\"id\": 849, \"name\": \"dc comics\"}, {\"id\": 853,... \n",
685
+ "4 [{\"id\": 818, \"name\": \"based on novel\"}, {\"id\":... \n",
686
+ "\n",
687
+ " crew \n",
688
+ "0 [{\"credit_id\": \"52fe48009251416c750aca23\", \"de... \n",
689
+ "1 [{\"credit_id\": \"52fe4232c3a36847f800b579\", \"de... \n",
690
+ "2 [{\"credit_id\": \"54805967c3a36829b5002c41\", \"de... \n",
691
+ "3 [{\"credit_id\": \"52fe4781c3a36847f81398c3\", \"de... \n",
692
+ "4 [{\"credit_id\": \"52fe479ac3a36847f813eaa3\", \"de... "
693
+ ]
694
+ },
695
+ "execution_count": 10,
696
+ "metadata": {},
697
+ "output_type": "execute_result"
698
+ }
699
+ ],
700
+ "source": [
701
+ "movies.head(5)"
702
+ ]
703
+ },
704
+ {
705
+ "cell_type": "markdown",
706
+ "id": "47a979b2",
707
+ "metadata": {},
708
+ "source": [
709
+ "## Missing Values"
710
+ ]
711
+ },
712
+ {
713
+ "cell_type": "code",
714
+ "execution_count": 11,
715
+ "id": "7d57cfb0",
716
+ "metadata": {},
717
+ "outputs": [
718
+ {
719
+ "data": {
720
+ "text/plain": [
721
+ "movie_id 0\n",
722
+ "title 0\n",
723
+ "overview 3\n",
724
+ "genres 0\n",
725
+ "cast 0\n",
726
+ "keywords 0\n",
727
+ "crew 0\n",
728
+ "dtype: int64"
729
+ ]
730
+ },
731
+ "execution_count": 11,
732
+ "metadata": {},
733
+ "output_type": "execute_result"
734
+ }
735
+ ],
736
+ "source": [
737
+ "#Checking for Missing Values\n",
738
+ "movies.isnull().sum()\n",
739
+ " "
740
+ ]
741
+ },
742
+ {
743
+ "cell_type": "code",
744
+ "execution_count": 12,
745
+ "id": "e43e3fc0",
746
+ "metadata": {},
747
+ "outputs": [],
748
+ "source": [
749
+ "#Dropping the missing values\n",
750
+ "movies.dropna(inplace=True)"
751
+ ]
752
+ },
753
+ {
754
+ "cell_type": "code",
755
+ "execution_count": 13,
756
+ "id": "4165f068",
757
+ "metadata": {},
758
+ "outputs": [
759
+ {
760
+ "data": {
761
+ "text/plain": [
762
+ "movie_id 0\n",
763
+ "title 0\n",
764
+ "overview 0\n",
765
+ "genres 0\n",
766
+ "cast 0\n",
767
+ "keywords 0\n",
768
+ "crew 0\n",
769
+ "dtype: int64"
770
+ ]
771
+ },
772
+ "execution_count": 13,
773
+ "metadata": {},
774
+ "output_type": "execute_result"
775
+ }
776
+ ],
777
+ "source": [
778
+ "#Checking again after dropping the missing values\n",
779
+ "movies.isnull().sum()"
780
+ ]
781
+ },
782
+ {
783
+ "cell_type": "code",
784
+ "execution_count": 14,
785
+ "id": "7043f76e",
786
+ "metadata": {},
787
+ "outputs": [
788
+ {
789
+ "data": {
790
+ "text/plain": [
791
+ "0"
792
+ ]
793
+ },
794
+ "execution_count": 14,
795
+ "metadata": {},
796
+ "output_type": "execute_result"
797
+ }
798
+ ],
799
+ "source": [
800
+ "#Checkinf for any duplication in data\n",
801
+ "movies.duplicated().sum()"
802
+ ]
803
+ },
804
+ {
805
+ "cell_type": "code",
806
+ "execution_count": 15,
807
+ "id": "a7bd44cd",
808
+ "metadata": {},
809
+ "outputs": [
810
+ {
811
+ "data": {
812
+ "text/plain": [
813
+ "'[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"name\": \"Adventure\"}, {\"id\": 14, \"name\": \"Fantasy\"}, {\"id\": 878, \"name\": \"Science Fiction\"}]'"
814
+ ]
815
+ },
816
+ "execution_count": 15,
817
+ "metadata": {},
818
+ "output_type": "execute_result"
819
+ }
820
+ ],
821
+ "source": [
822
+ "#checking genres randomly using index position as 0\n",
823
+ "movies.iloc[0].genres"
824
+ ]
825
+ },
826
+ {
827
+ "cell_type": "code",
828
+ "execution_count": 16,
829
+ "id": "e69162a1",
830
+ "metadata": {},
831
+ "outputs": [],
832
+ "source": [
833
+ "#AST's(Abstract Syntax Tree (AST)) are mainly used in compilers to check code for their accuracy.\n",
834
+ "#Because some constructs cannot be represented in context-free grammar, such as implicit typing.\n",
835
+ "import ast"
836
+ ]
837
+ },
838
+ {
839
+ "cell_type": "markdown",
840
+ "id": "eaa1ec3f",
841
+ "metadata": {},
842
+ "source": [
843
+ "ast.literal_eval raises an exception if the input isn't a valid Python datatype, so the code won't be executed if it's not."
844
+ ]
845
+ },
846
+ {
847
+ "cell_type": "markdown",
848
+ "id": "7a22e725",
849
+ "metadata": {},
850
+ "source": [
851
+ "### Function for extracting values from raw data for the creation of tags"
852
+ ]
853
+ },
854
+ {
855
+ "cell_type": "code",
856
+ "execution_count": 17,
857
+ "id": "2a5e99a2",
858
+ "metadata": {},
859
+ "outputs": [],
860
+ "source": [
861
+ "#Extracting genres,keywords from raw data for the creation of tags\n",
862
+ "#Creating a fuction convert \n",
863
+ "\n",
864
+ "def convert(obj):\n",
865
+ " L = []\n",
866
+ " for i in ast.literal_eval(obj):\n",
867
+ " L.append(i['name'])\n",
868
+ " return L"
869
+ ]
870
+ },
871
+ {
872
+ "cell_type": "markdown",
873
+ "id": "57e31dc7",
874
+ "metadata": {},
875
+ "source": [
876
+ "### Extracting Genres"
877
+ ]
878
+ },
879
+ {
880
+ "cell_type": "code",
881
+ "execution_count": 18,
882
+ "id": "94540349",
883
+ "metadata": {},
884
+ "outputs": [],
885
+ "source": [
886
+ "#Applying the convert function to genres column to extract the required data\n",
887
+ "movies['genres'] = movies['genres'].apply(convert)"
888
+ ]
889
+ },
890
+ {
891
+ "cell_type": "markdown",
892
+ "id": "ba5a12e5",
893
+ "metadata": {},
894
+ "source": [
895
+ "### Extracting Keywords"
896
+ ]
897
+ },
898
+ {
899
+ "cell_type": "code",
900
+ "execution_count": 19,
901
+ "id": "3f486c75",
902
+ "metadata": {},
903
+ "outputs": [],
904
+ "source": [
905
+ "#Applying the convert function to keyword column to extract the required data\n",
906
+ "movies['keywords'] = movies['keywords'].apply(convert)"
907
+ ]
908
+ },
909
+ {
910
+ "cell_type": "markdown",
911
+ "id": "ad28db1c",
912
+ "metadata": {},
913
+ "source": [
914
+ "### Function for extracting top 3 actors from the movie"
915
+ ]
916
+ },
917
+ {
918
+ "cell_type": "code",
919
+ "execution_count": 20,
920
+ "id": "2e4317e4",
921
+ "metadata": {},
922
+ "outputs": [],
923
+ "source": [
924
+ "# Creating a function for extracting top 3 actors from the movie \n",
925
+ " \n",
926
+ "def convert3(obj):\n",
927
+ " L=[]\n",
928
+ " counter=0\n",
929
+ " for i in ast.literal_eval(obj):\n",
930
+ " if counter !=3:\n",
931
+ " L.append(i['name'])\n",
932
+ " counter+=1\n",
933
+ " else:\n",
934
+ " break\n",
935
+ " return L"
936
+ ]
937
+ },
938
+ {
939
+ "cell_type": "code",
940
+ "execution_count": 21,
941
+ "id": "69bfef22",
942
+ "metadata": {},
943
+ "outputs": [],
944
+ "source": [
945
+ "#Applying the convert3 function to cast column to extract the required data\n",
946
+ "movies['cast'] = movies['cast'].apply(convert3)"
947
+ ]
948
+ },
949
+ {
950
+ "cell_type": "markdown",
951
+ "id": "31059f45",
952
+ "metadata": {},
953
+ "source": [
954
+ "### Function to fetch the director of movie from crew column"
955
+ ]
956
+ },
957
+ {
958
+ "cell_type": "code",
959
+ "execution_count": 22,
960
+ "id": "c2bebb24",
961
+ "metadata": {},
962
+ "outputs": [],
963
+ "source": [
964
+ "#Creating a function to fetch the director of movie from crew column\n",
965
+ "def fetch_director(obj):\n",
966
+ " L=[]\n",
967
+ " for i in ast.literal_eval(obj):\n",
968
+ " if i['job'] == 'Director':\n",
969
+ " L.append(i['name'])\n",
970
+ " break\n",
971
+ " return L"
972
+ ]
973
+ },
974
+ {
975
+ "cell_type": "code",
976
+ "execution_count": 23,
977
+ "id": "1f092c8a",
978
+ "metadata": {},
979
+ "outputs": [],
980
+ "source": [
981
+ "# Applying the fetch_director function to cast column to extract the required data\n",
982
+ "movies['crew'] = movies['crew'].apply(fetch_director)\n",
983
+ " "
984
+ ]
985
+ },
986
+ {
987
+ "cell_type": "code",
988
+ "execution_count": 24,
989
+ "id": "fd484435",
990
+ "metadata": {},
991
+ "outputs": [],
992
+ "source": [
993
+ "#Converting Overviewcolumn data to an array \n",
994
+ "movies['overview'] = movies['overview'].apply(lambda x:x.split())"
995
+ ]
996
+ },
997
+ {
998
+ "cell_type": "markdown",
999
+ "id": "c13f649f",
1000
+ "metadata": {},
1001
+ "source": [
1002
+ "Here, I am trying to replace this kind of answer in my data frame : case_1 case_2 case_3 by : [case_1,case_2,case_3] .apply(lambda x: x.split()) seems to be a good way to do it"
1003
+ ]
1004
+ },
1005
+ {
1006
+ "cell_type": "markdown",
1007
+ "id": "20cc620f",
1008
+ "metadata": {},
1009
+ "source": [
1010
+ "### Checking the Data after extracting all the required values"
1011
+ ]
1012
+ },
1013
+ {
1014
+ "cell_type": "code",
1015
+ "execution_count": 25,
1016
+ "id": "422e70e8",
1017
+ "metadata": {},
1018
+ "outputs": [
1019
+ {
1020
+ "data": {
1021
+ "text/html": [
1022
+ "<div>\n",
1023
+ "<style scoped>\n",
1024
+ " .dataframe tbody tr th:only-of-type {\n",
1025
+ " vertical-align: middle;\n",
1026
+ " }\n",
1027
+ "\n",
1028
+ " .dataframe tbody tr th {\n",
1029
+ " vertical-align: top;\n",
1030
+ " }\n",
1031
+ "\n",
1032
+ " .dataframe thead th {\n",
1033
+ " text-align: right;\n",
1034
+ " }\n",
1035
+ "</style>\n",
1036
+ "<table border=\"1\" class=\"dataframe\">\n",
1037
+ " <thead>\n",
1038
+ " <tr style=\"text-align: right;\">\n",
1039
+ " <th></th>\n",
1040
+ " <th>movie_id</th>\n",
1041
+ " <th>title</th>\n",
1042
+ " <th>overview</th>\n",
1043
+ " <th>genres</th>\n",
1044
+ " <th>cast</th>\n",
1045
+ " <th>keywords</th>\n",
1046
+ " <th>crew</th>\n",
1047
+ " </tr>\n",
1048
+ " </thead>\n",
1049
+ " <tbody>\n",
1050
+ " <tr>\n",
1051
+ " <th>0</th>\n",
1052
+ " <td>19995</td>\n",
1053
+ " <td>Avatar</td>\n",
1054
+ " <td>[In, the, 22nd, century,, a, paraplegic, Marin...</td>\n",
1055
+ " <td>[Action, Adventure, Fantasy, Science Fiction]</td>\n",
1056
+ " <td>[Sam Worthington, Zoe Saldana, Sigourney Weaver]</td>\n",
1057
+ " <td>[culture clash, future, space war, space colon...</td>\n",
1058
+ " <td>[James Cameron]</td>\n",
1059
+ " </tr>\n",
1060
+ " <tr>\n",
1061
+ " <th>1</th>\n",
1062
+ " <td>285</td>\n",
1063
+ " <td>Pirates of the Caribbean: At World's End</td>\n",
1064
+ " <td>[Captain, Barbossa,, long, believed, to, be, d...</td>\n",
1065
+ " <td>[Adventure, Fantasy, Action]</td>\n",
1066
+ " <td>[Johnny Depp, Orlando Bloom, Keira Knightley]</td>\n",
1067
+ " <td>[ocean, drug abuse, exotic island, east india ...</td>\n",
1068
+ " <td>[Gore Verbinski]</td>\n",
1069
+ " </tr>\n",
1070
+ " <tr>\n",
1071
+ " <th>2</th>\n",
1072
+ " <td>206647</td>\n",
1073
+ " <td>Spectre</td>\n",
1074
+ " <td>[A, cryptic, message, from, Bond’s, past, send...</td>\n",
1075
+ " <td>[Action, Adventure, Crime]</td>\n",
1076
+ " <td>[Daniel Craig, Christoph Waltz, Léa Seydoux]</td>\n",
1077
+ " <td>[spy, based on novel, secret agent, sequel, mi...</td>\n",
1078
+ " <td>[Sam Mendes]</td>\n",
1079
+ " </tr>\n",
1080
+ " <tr>\n",
1081
+ " <th>3</th>\n",
1082
+ " <td>49026</td>\n",
1083
+ " <td>The Dark Knight Rises</td>\n",
1084
+ " <td>[Following, the, death, of, District, Attorney...</td>\n",
1085
+ " <td>[Action, Crime, Drama, Thriller]</td>\n",
1086
+ " <td>[Christian Bale, Michael Caine, Gary Oldman]</td>\n",
1087
+ " <td>[dc comics, crime fighter, terrorist, secret i...</td>\n",
1088
+ " <td>[Christopher Nolan]</td>\n",
1089
+ " </tr>\n",
1090
+ " <tr>\n",
1091
+ " <th>4</th>\n",
1092
+ " <td>49529</td>\n",
1093
+ " <td>John Carter</td>\n",
1094
+ " <td>[John, Carter, is, a, war-weary,, former, mili...</td>\n",
1095
+ " <td>[Action, Adventure, Science Fiction]</td>\n",
1096
+ " <td>[Taylor Kitsch, Lynn Collins, Samantha Morton]</td>\n",
1097
+ " <td>[based on novel, mars, medallion, space travel...</td>\n",
1098
+ " <td>[Andrew Stanton]</td>\n",
1099
+ " </tr>\n",
1100
+ " </tbody>\n",
1101
+ "</table>\n",
1102
+ "</div>"
1103
+ ],
1104
+ "text/plain": [
1105
+ " movie_id title \\\n",
1106
+ "0 19995 Avatar \n",
1107
+ "1 285 Pirates of the Caribbean: At World's End \n",
1108
+ "2 206647 Spectre \n",
1109
+ "3 49026 The Dark Knight Rises \n",
1110
+ "4 49529 John Carter \n",
1111
+ "\n",
1112
+ " overview \\\n",
1113
+ "0 [In, the, 22nd, century,, a, paraplegic, Marin... \n",
1114
+ "1 [Captain, Barbossa,, long, believed, to, be, d... \n",
1115
+ "2 [A, cryptic, message, from, Bond’s, past, send... \n",
1116
+ "3 [Following, the, death, of, District, Attorney... \n",
1117
+ "4 [John, Carter, is, a, war-weary,, former, mili... \n",
1118
+ "\n",
1119
+ " genres \\\n",
1120
+ "0 [Action, Adventure, Fantasy, Science Fiction] \n",
1121
+ "1 [Adventure, Fantasy, Action] \n",
1122
+ "2 [Action, Adventure, Crime] \n",
1123
+ "3 [Action, Crime, Drama, Thriller] \n",
1124
+ "4 [Action, Adventure, Science Fiction] \n",
1125
+ "\n",
1126
+ " cast \\\n",
1127
+ "0 [Sam Worthington, Zoe Saldana, Sigourney Weaver] \n",
1128
+ "1 [Johnny Depp, Orlando Bloom, Keira Knightley] \n",
1129
+ "2 [Daniel Craig, Christoph Waltz, Léa Seydoux] \n",
1130
+ "3 [Christian Bale, Michael Caine, Gary Oldman] \n",
1131
+ "4 [Taylor Kitsch, Lynn Collins, Samantha Morton] \n",
1132
+ "\n",
1133
+ " keywords crew \n",
1134
+ "0 [culture clash, future, space war, space colon... [James Cameron] \n",
1135
+ "1 [ocean, drug abuse, exotic island, east india ... [Gore Verbinski] \n",
1136
+ "2 [spy, based on novel, secret agent, sequel, mi... [Sam Mendes] \n",
1137
+ "3 [dc comics, crime fighter, terrorist, secret i... [Christopher Nolan] \n",
1138
+ "4 [based on novel, mars, medallion, space travel... [Andrew Stanton] "
1139
+ ]
1140
+ },
1141
+ "execution_count": 25,
1142
+ "metadata": {},
1143
+ "output_type": "execute_result"
1144
+ }
1145
+ ],
1146
+ "source": [
1147
+ "# Checking the Final Data after extracting all the required values\n",
1148
+ "movies.head(5)"
1149
+ ]
1150
+ },
1151
+ {
1152
+ "cell_type": "code",
1153
+ "execution_count": 26,
1154
+ "id": "5f0d2e91",
1155
+ "metadata": {},
1156
+ "outputs": [],
1157
+ "source": [
1158
+ "#Applying a transformation to remove spaces between words \n",
1159
+ "\n",
1160
+ "movies['genres'] = movies['genres'].apply(lambda x:[i.replace(\" \",\"\") for i in x])\n",
1161
+ "movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(\" \",\"\") for i in x])\n",
1162
+ "movies['cast'] = movies['cast'].apply(lambda x:[i.replace(\" \",\"\") for i in x])\n",
1163
+ "movies['crew'] = movies['crew'].apply(lambda x:[i.replace(\" \",\"\") for i in x])"
1164
+ ]
1165
+ },
1166
+ {
1167
+ "cell_type": "code",
1168
+ "execution_count": 27,
1169
+ "id": "50908dd1",
1170
+ "metadata": {},
1171
+ "outputs": [],
1172
+ "source": [
1173
+ "# In the tags column inserting all the data to use it to create my recommendation system\n",
1174
+ "movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']"
1175
+ ]
1176
+ },
1177
+ {
1178
+ "cell_type": "code",
1179
+ "execution_count": 28,
1180
+ "id": "eacb1e82",
1181
+ "metadata": {},
1182
+ "outputs": [
1183
+ {
1184
+ "data": {
1185
+ "text/html": [
1186
+ "<div>\n",
1187
+ "<style scoped>\n",
1188
+ " .dataframe tbody tr th:only-of-type {\n",
1189
+ " vertical-align: middle;\n",
1190
+ " }\n",
1191
+ "\n",
1192
+ " .dataframe tbody tr th {\n",
1193
+ " vertical-align: top;\n",
1194
+ " }\n",
1195
+ "\n",
1196
+ " .dataframe thead th {\n",
1197
+ " text-align: right;\n",
1198
+ " }\n",
1199
+ "</style>\n",
1200
+ "<table border=\"1\" class=\"dataframe\">\n",
1201
+ " <thead>\n",
1202
+ " <tr style=\"text-align: right;\">\n",
1203
+ " <th></th>\n",
1204
+ " <th>movie_id</th>\n",
1205
+ " <th>title</th>\n",
1206
+ " <th>overview</th>\n",
1207
+ " <th>genres</th>\n",
1208
+ " <th>cast</th>\n",
1209
+ " <th>keywords</th>\n",
1210
+ " <th>crew</th>\n",
1211
+ " <th>tags</th>\n",
1212
+ " </tr>\n",
1213
+ " </thead>\n",
1214
+ " <tbody>\n",
1215
+ " <tr>\n",
1216
+ " <th>0</th>\n",
1217
+ " <td>19995</td>\n",
1218
+ " <td>Avatar</td>\n",
1219
+ " <td>[In, the, 22nd, century,, a, paraplegic, Marin...</td>\n",
1220
+ " <td>[Action, Adventure, Fantasy, ScienceFiction]</td>\n",
1221
+ " <td>[SamWorthington, ZoeSaldana, SigourneyWeaver]</td>\n",
1222
+ " <td>[cultureclash, future, spacewar, spacecolony, ...</td>\n",
1223
+ " <td>[JamesCameron]</td>\n",
1224
+ " <td>[In, the, 22nd, century,, a, paraplegic, Marin...</td>\n",
1225
+ " </tr>\n",
1226
+ " <tr>\n",
1227
+ " <th>1</th>\n",
1228
+ " <td>285</td>\n",
1229
+ " <td>Pirates of the Caribbean: At World's End</td>\n",
1230
+ " <td>[Captain, Barbossa,, long, believed, to, be, d...</td>\n",
1231
+ " <td>[Adventure, Fantasy, Action]</td>\n",
1232
+ " <td>[JohnnyDepp, OrlandoBloom, KeiraKnightley]</td>\n",
1233
+ " <td>[ocean, drugabuse, exoticisland, eastindiatrad...</td>\n",
1234
+ " <td>[GoreVerbinski]</td>\n",
1235
+ " <td>[Captain, Barbossa,, long, believed, to, be, d...</td>\n",
1236
+ " </tr>\n",
1237
+ " <tr>\n",
1238
+ " <th>2</th>\n",
1239
+ " <td>206647</td>\n",
1240
+ " <td>Spectre</td>\n",
1241
+ " <td>[A, cryptic, message, from, Bond’s, past, send...</td>\n",
1242
+ " <td>[Action, Adventure, Crime]</td>\n",
1243
+ " <td>[DanielCraig, ChristophWaltz, LéaSeydoux]</td>\n",
1244
+ " <td>[spy, basedonnovel, secretagent, sequel, mi6, ...</td>\n",
1245
+ " <td>[SamMendes]</td>\n",
1246
+ " <td>[A, cryptic, message, from, Bond’s, past, send...</td>\n",
1247
+ " </tr>\n",
1248
+ " <tr>\n",
1249
+ " <th>3</th>\n",
1250
+ " <td>49026</td>\n",
1251
+ " <td>The Dark Knight Rises</td>\n",
1252
+ " <td>[Following, the, death, of, District, Attorney...</td>\n",
1253
+ " <td>[Action, Crime, Drama, Thriller]</td>\n",
1254
+ " <td>[ChristianBale, MichaelCaine, GaryOldman]</td>\n",
1255
+ " <td>[dccomics, crimefighter, terrorist, secretiden...</td>\n",
1256
+ " <td>[ChristopherNolan]</td>\n",
1257
+ " <td>[Following, the, death, of, District, Attorney...</td>\n",
1258
+ " </tr>\n",
1259
+ " <tr>\n",
1260
+ " <th>4</th>\n",
1261
+ " <td>49529</td>\n",
1262
+ " <td>John Carter</td>\n",
1263
+ " <td>[John, Carter, is, a, war-weary,, former, mili...</td>\n",
1264
+ " <td>[Action, Adventure, ScienceFiction]</td>\n",
1265
+ " <td>[TaylorKitsch, LynnCollins, SamanthaMorton]</td>\n",
1266
+ " <td>[basedonnovel, mars, medallion, spacetravel, p...</td>\n",
1267
+ " <td>[AndrewStanton]</td>\n",
1268
+ " <td>[John, Carter, is, a, war-weary,, former, mili...</td>\n",
1269
+ " </tr>\n",
1270
+ " </tbody>\n",
1271
+ "</table>\n",
1272
+ "</div>"
1273
+ ],
1274
+ "text/plain": [
1275
+ " movie_id title \\\n",
1276
+ "0 19995 Avatar \n",
1277
+ "1 285 Pirates of the Caribbean: At World's End \n",
1278
+ "2 206647 Spectre \n",
1279
+ "3 49026 The Dark Knight Rises \n",
1280
+ "4 49529 John Carter \n",
1281
+ "\n",
1282
+ " overview \\\n",
1283
+ "0 [In, the, 22nd, century,, a, paraplegic, Marin... \n",
1284
+ "1 [Captain, Barbossa,, long, believed, to, be, d... \n",
1285
+ "2 [A, cryptic, message, from, Bond’s, past, send... \n",
1286
+ "3 [Following, the, death, of, District, Attorney... \n",
1287
+ "4 [John, Carter, is, a, war-weary,, former, mili... \n",
1288
+ "\n",
1289
+ " genres \\\n",
1290
+ "0 [Action, Adventure, Fantasy, ScienceFiction] \n",
1291
+ "1 [Adventure, Fantasy, Action] \n",
1292
+ "2 [Action, Adventure, Crime] \n",
1293
+ "3 [Action, Crime, Drama, Thriller] \n",
1294
+ "4 [Action, Adventure, ScienceFiction] \n",
1295
+ "\n",
1296
+ " cast \\\n",
1297
+ "0 [SamWorthington, ZoeSaldana, SigourneyWeaver] \n",
1298
+ "1 [JohnnyDepp, OrlandoBloom, KeiraKnightley] \n",
1299
+ "2 [DanielCraig, ChristophWaltz, LéaSeydoux] \n",
1300
+ "3 [ChristianBale, MichaelCaine, GaryOldman] \n",
1301
+ "4 [TaylorKitsch, LynnCollins, SamanthaMorton] \n",
1302
+ "\n",
1303
+ " keywords crew \\\n",
1304
+ "0 [cultureclash, future, spacewar, spacecolony, ... [JamesCameron] \n",
1305
+ "1 [ocean, drugabuse, exoticisland, eastindiatrad... [GoreVerbinski] \n",
1306
+ "2 [spy, basedonnovel, secretagent, sequel, mi6, ... [SamMendes] \n",
1307
+ "3 [dccomics, crimefighter, terrorist, secretiden... [ChristopherNolan] \n",
1308
+ "4 [basedonnovel, mars, medallion, spacetravel, p... [AndrewStanton] \n",
1309
+ "\n",
1310
+ " tags \n",
1311
+ "0 [In, the, 22nd, century,, a, paraplegic, Marin... \n",
1312
+ "1 [Captain, Barbossa,, long, believed, to, be, d... \n",
1313
+ "2 [A, cryptic, message, from, Bond’s, past, send... \n",
1314
+ "3 [Following, the, death, of, District, Attorney... \n",
1315
+ "4 [John, Carter, is, a, war-weary,, former, mili... "
1316
+ ]
1317
+ },
1318
+ "execution_count": 28,
1319
+ "metadata": {},
1320
+ "output_type": "execute_result"
1321
+ }
1322
+ ],
1323
+ "source": [
1324
+ "movies.head()"
1325
+ ]
1326
+ },
1327
+ {
1328
+ "cell_type": "code",
1329
+ "execution_count": 29,
1330
+ "id": "7fd341dc",
1331
+ "metadata": {},
1332
+ "outputs": [],
1333
+ "source": [
1334
+ "#Craeting a new dataframe with 3 columns \n",
1335
+ "new_df = movies[['movie_id','title','tags']]"
1336
+ ]
1337
+ },
1338
+ {
1339
+ "cell_type": "code",
1340
+ "execution_count": 30,
1341
+ "id": "bf6743a7",
1342
+ "metadata": {},
1343
+ "outputs": [],
1344
+ "source": [
1345
+ "# Supressing the warning messages\n",
1346
+ "import warnings\n",
1347
+ "warnings.filterwarnings('ignore')\n",
1348
+ "\n",
1349
+ "#Joining all the data togther\n",
1350
+ "new_df['tags'] = new_df['tags'].apply(lambda x:\" \".join(x))"
1351
+ ]
1352
+ },
1353
+ {
1354
+ "cell_type": "code",
1355
+ "execution_count": 31,
1356
+ "id": "e2d1d383",
1357
+ "metadata": {},
1358
+ "outputs": [
1359
+ {
1360
+ "data": {
1361
+ "text/html": [
1362
+ "<div>\n",
1363
+ "<style scoped>\n",
1364
+ " .dataframe tbody tr th:only-of-type {\n",
1365
+ " vertical-align: middle;\n",
1366
+ " }\n",
1367
+ "\n",
1368
+ " .dataframe tbody tr th {\n",
1369
+ " vertical-align: top;\n",
1370
+ " }\n",
1371
+ "\n",
1372
+ " .dataframe thead th {\n",
1373
+ " text-align: right;\n",
1374
+ " }\n",
1375
+ "</style>\n",
1376
+ "<table border=\"1\" class=\"dataframe\">\n",
1377
+ " <thead>\n",
1378
+ " <tr style=\"text-align: right;\">\n",
1379
+ " <th></th>\n",
1380
+ " <th>movie_id</th>\n",
1381
+ " <th>title</th>\n",
1382
+ " <th>tags</th>\n",
1383
+ " </tr>\n",
1384
+ " </thead>\n",
1385
+ " <tbody>\n",
1386
+ " <tr>\n",
1387
+ " <th>0</th>\n",
1388
+ " <td>19995</td>\n",
1389
+ " <td>Avatar</td>\n",
1390
+ " <td>In the 22nd century, a paraplegic Marine is di...</td>\n",
1391
+ " </tr>\n",
1392
+ " <tr>\n",
1393
+ " <th>1</th>\n",
1394
+ " <td>285</td>\n",
1395
+ " <td>Pirates of the Caribbean: At World's End</td>\n",
1396
+ " <td>Captain Barbossa, long believed to be dead, ha...</td>\n",
1397
+ " </tr>\n",
1398
+ " <tr>\n",
1399
+ " <th>2</th>\n",
1400
+ " <td>206647</td>\n",
1401
+ " <td>Spectre</td>\n",
1402
+ " <td>A cryptic message from Bond’s past sends him o...</td>\n",
1403
+ " </tr>\n",
1404
+ " <tr>\n",
1405
+ " <th>3</th>\n",
1406
+ " <td>49026</td>\n",
1407
+ " <td>The Dark Knight Rises</td>\n",
1408
+ " <td>Following the death of District Attorney Harve...</td>\n",
1409
+ " </tr>\n",
1410
+ " <tr>\n",
1411
+ " <th>4</th>\n",
1412
+ " <td>49529</td>\n",
1413
+ " <td>John Carter</td>\n",
1414
+ " <td>John Carter is a war-weary, former military ca...</td>\n",
1415
+ " </tr>\n",
1416
+ " </tbody>\n",
1417
+ "</table>\n",
1418
+ "</div>"
1419
+ ],
1420
+ "text/plain": [
1421
+ " movie_id title \\\n",
1422
+ "0 19995 Avatar \n",
1423
+ "1 285 Pirates of the Caribbean: At World's End \n",
1424
+ "2 206647 Spectre \n",
1425
+ "3 49026 The Dark Knight Rises \n",
1426
+ "4 49529 John Carter \n",
1427
+ "\n",
1428
+ " tags \n",
1429
+ "0 In the 22nd century, a paraplegic Marine is di... \n",
1430
+ "1 Captain Barbossa, long believed to be dead, ha... \n",
1431
+ "2 A cryptic message from Bond’s past sends him o... \n",
1432
+ "3 Following the death of District Attorney Harve... \n",
1433
+ "4 John Carter is a war-weary, former military ca... "
1434
+ ]
1435
+ },
1436
+ "execution_count": 31,
1437
+ "metadata": {},
1438
+ "output_type": "execute_result"
1439
+ }
1440
+ ],
1441
+ "source": [
1442
+ "#Checking the new data\n",
1443
+ "new_df.head()"
1444
+ ]
1445
+ },
1446
+ {
1447
+ "cell_type": "code",
1448
+ "execution_count": 32,
1449
+ "id": "a92ee349",
1450
+ "metadata": {},
1451
+ "outputs": [
1452
+ {
1453
+ "data": {
1454
+ "text/plain": [
1455
+ "'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'"
1456
+ ]
1457
+ },
1458
+ "execution_count": 32,
1459
+ "metadata": {},
1460
+ "output_type": "execute_result"
1461
+ }
1462
+ ],
1463
+ "source": [
1464
+ "#Checking one of the tags to check how the data looks\n",
1465
+ "new_df['tags'][0]"
1466
+ ]
1467
+ },
1468
+ {
1469
+ "cell_type": "code",
1470
+ "execution_count": 33,
1471
+ "id": "383b01eb",
1472
+ "metadata": {},
1473
+ "outputs": [],
1474
+ "source": [
1475
+ "#Converting the tags data into lowercase\n",
1476
+ "new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())"
1477
+ ]
1478
+ },
1479
+ {
1480
+ "cell_type": "code",
1481
+ "execution_count": 34,
1482
+ "id": "5f145429",
1483
+ "metadata": {},
1484
+ "outputs": [
1485
+ {
1486
+ "data": {
1487
+ "text/plain": [
1488
+ "'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'"
1489
+ ]
1490
+ },
1491
+ "execution_count": 34,
1492
+ "metadata": {},
1493
+ "output_type": "execute_result"
1494
+ }
1495
+ ],
1496
+ "source": [
1497
+ "#Checking again after applying the lower case function\n",
1498
+ "new_df['tags'][0]"
1499
+ ]
1500
+ },
1501
+ {
1502
+ "cell_type": "markdown",
1503
+ "id": "e28bcd88",
1504
+ "metadata": {},
1505
+ "source": [
1506
+ "## Text Vectorization"
1507
+ ]
1508
+ },
1509
+ {
1510
+ "cell_type": "code",
1511
+ "execution_count": 35,
1512
+ "id": "e7bdc9f6",
1513
+ "metadata": {},
1514
+ "outputs": [],
1515
+ "source": [
1516
+ "#Importing this module to convert a collection of text documents to a matrix of token counts.\n",
1517
+ "from sklearn.feature_extraction.text import CountVectorizer\n",
1518
+ " "
1519
+ ]
1520
+ },
1521
+ {
1522
+ "cell_type": "code",
1523
+ "execution_count": 36,
1524
+ "id": "421b740a",
1525
+ "metadata": {},
1526
+ "outputs": [],
1527
+ "source": [
1528
+ "#Creating a variable cv to convert text to vector\n",
1529
+ "cv = CountVectorizer(max_features=5000,stop_words='english')"
1530
+ ]
1531
+ },
1532
+ {
1533
+ "cell_type": "code",
1534
+ "execution_count": 37,
1535
+ "id": "7f40be01",
1536
+ "metadata": {},
1537
+ "outputs": [],
1538
+ "source": [
1539
+ "# Transforming the data to vectors and storing as an array\n",
1540
+ "vectors = cv.fit_transform(new_df['tags']).toarray()"
1541
+ ]
1542
+ },
1543
+ {
1544
+ "cell_type": "code",
1545
+ "execution_count": 38,
1546
+ "id": "29c36bf0",
1547
+ "metadata": {},
1548
+ "outputs": [],
1549
+ "source": [
1550
+ "## Most frequent 5000 words\n",
1551
+ "# cv.get_feature_names()"
1552
+ ]
1553
+ },
1554
+ {
1555
+ "cell_type": "markdown",
1556
+ "id": "4a3e66f6",
1557
+ "metadata": {},
1558
+ "source": [
1559
+ "## Applying Stemming Process"
1560
+ ]
1561
+ },
1562
+ {
1563
+ "cell_type": "markdown",
1564
+ "id": "8cbff737",
1565
+ "metadata": {},
1566
+ "source": [
1567
+ "Stemming is a natural language processing technique that lowers inflection in words to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.Simply put it is reducing the words or chopping the words into their root forms for e.g eating becomes eat and so on. So in stemming there are different stemmers and we are going to discuss PortersStemmer the most popularly used one.\n",
1568
+ "\n",
1569
+ "Porters Stemmer It is a type of stemmer which is mainly known for Data Mining and Information Retrieval. As its applications are limited to the English language only. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes, it is also majorly known for its simplicity and speed. The advantage is, it produces the best output from other stemmers and has less error rate."
1570
+ ]
1571
+ },
1572
+ {
1573
+ "cell_type": "code",
1574
+ "execution_count": 39,
1575
+ "id": "26ff7212",
1576
+ "metadata": {},
1577
+ "outputs": [],
1578
+ "source": [
1579
+ "#Importing the NLTK library for stemming process\n",
1580
+ "import nltk "
1581
+ ]
1582
+ },
1583
+ {
1584
+ "cell_type": "code",
1585
+ "execution_count": 40,
1586
+ "id": "3f2e9abf",
1587
+ "metadata": {},
1588
+ "outputs": [],
1589
+ "source": [
1590
+ "#From NLTK import PorterStemmer & then Creating a variable and storing PorterStemmer into it\n",
1591
+ "from nltk.stem.porter import PorterStemmer\n",
1592
+ "ps = PorterStemmer()\n",
1593
+ " "
1594
+ ]
1595
+ },
1596
+ {
1597
+ "cell_type": "code",
1598
+ "execution_count": 41,
1599
+ "id": "5c7cd073",
1600
+ "metadata": {},
1601
+ "outputs": [],
1602
+ "source": [
1603
+ "#Defining the stemming function\n",
1604
+ "def stem(text):\n",
1605
+ " y=[]\n",
1606
+ " for i in text.split():\n",
1607
+ " y.append(ps.stem(i))\n",
1608
+ " return \" \".join(y)\n",
1609
+ " "
1610
+ ]
1611
+ },
1612
+ {
1613
+ "cell_type": "code",
1614
+ "execution_count": 42,
1615
+ "id": "9fab33f0",
1616
+ "metadata": {},
1617
+ "outputs": [
1618
+ {
1619
+ "data": {
1620
+ "text/plain": [
1621
+ "'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'"
1622
+ ]
1623
+ },
1624
+ "execution_count": 42,
1625
+ "metadata": {},
1626
+ "output_type": "execute_result"
1627
+ }
1628
+ ],
1629
+ "source": [
1630
+ "#Checking on the sample text\n",
1631
+ "stem('In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron')\n",
1632
+ " "
1633
+ ]
1634
+ },
1635
+ {
1636
+ "cell_type": "code",
1637
+ "execution_count": 43,
1638
+ "id": "7044bea2",
1639
+ "metadata": {},
1640
+ "outputs": [],
1641
+ "source": [
1642
+ "#Applying the stemming function to the tags column in our new data\n",
1643
+ "new_df['tags'] = new_df['tags'].apply(stem)"
1644
+ ]
1645
+ },
1646
+ {
1647
+ "cell_type": "markdown",
1648
+ "id": "0beb8098",
1649
+ "metadata": {},
1650
+ "source": [
1651
+ "## Similarity Measures"
1652
+ ]
1653
+ },
1654
+ {
1655
+ "cell_type": "markdown",
1656
+ "id": "76adf86f",
1657
+ "metadata": {},
1658
+ "source": [
1659
+ "Here, in this case-study We will use the Cosine Similarity from Sklearn, as the metric to compute the similarity between two movies.\n",
1660
+ "\n",
1661
+ "Cosine similarity is a metric used to measure how similar two items are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The output value ranges from 0–1.\n",
1662
+ "\n",
1663
+ "0 means no similarity, where as 1 means that both the items are 100% similar.\n",
1664
+ "\n"
1665
+ ]
1666
+ },
1667
+ {
1668
+ "cell_type": "code",
1669
+ "execution_count": 44,
1670
+ "id": "4a7cb6e7",
1671
+ "metadata": {},
1672
+ "outputs": [],
1673
+ "source": [
1674
+ "#importing the cosine similarity from sklearn\n",
1675
+ "from sklearn.metrics.pairwise import cosine_similarity"
1676
+ ]
1677
+ },
1678
+ {
1679
+ "cell_type": "code",
1680
+ "execution_count": 45,
1681
+ "id": "9957c15b",
1682
+ "metadata": {},
1683
+ "outputs": [],
1684
+ "source": [
1685
+ "#Creating a variable similarity and computing cosine_similarity of the vector\n",
1686
+ "similarity = cosine_similarity(vectors)\n"
1687
+ ]
1688
+ },
1689
+ {
1690
+ "cell_type": "markdown",
1691
+ "id": "6f626560",
1692
+ "metadata": {},
1693
+ "source": [
1694
+ "## Making the recommendation function"
1695
+ ]
1696
+ },
1697
+ {
1698
+ "cell_type": "code",
1699
+ "execution_count": 46,
1700
+ "id": "9089b7cb",
1701
+ "metadata": {},
1702
+ "outputs": [],
1703
+ "source": [
1704
+ "#Creating the function for Movie Recommendation using cosine similarity\n",
1705
+ "def recommend(movie):\n",
1706
+ " #Get the index from the name of the movie input\n",
1707
+ " movie_index = new_df[new_df['title'] == movie].index[0] \n",
1708
+ " #Generating similar movies\n",
1709
+ " distances = similarity[movie_index] \n",
1710
+ " #Generate a list of similar movies\n",
1711
+ " #sorting the movies in the list similar_movies. We have used the parameter reverse=True \n",
1712
+ " #since we want the list of 5 in the descending order,with the most similar item at the top\n",
1713
+ " movies_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:6] \n",
1714
+ " \n",
1715
+ " \n",
1716
+ " for i in movies_list:\n",
1717
+ " print(new_df.iloc[i[0]].title)"
1718
+ ]
1719
+ },
1720
+ {
1721
+ "cell_type": "markdown",
1722
+ "id": "07001d9a",
1723
+ "metadata": {},
1724
+ "source": [
1725
+ "## Recommendation"
1726
+ ]
1727
+ },
1728
+ {
1729
+ "cell_type": "code",
1730
+ "execution_count": 47,
1731
+ "id": "2229cfab",
1732
+ "metadata": {},
1733
+ "outputs": [
1734
+ {
1735
+ "name": "stdout",
1736
+ "output_type": "stream",
1737
+ "text": [
1738
+ "The Dark Knight\n",
1739
+ "The Dark Knight Rises\n",
1740
+ "Batman\n",
1741
+ "Batman & Robin\n",
1742
+ "Batman\n"
1743
+ ]
1744
+ }
1745
+ ],
1746
+ "source": [
1747
+ "#Enter movies only which are in the dataset, otherwise it would result in error\n",
1748
+ "recommend('Batman Begins') "
1749
+ ]
1750
+ },
1751
+ {
1752
+ "cell_type": "code",
1753
+ "execution_count": 48,
1754
+ "id": "692a4331",
1755
+ "metadata": {},
1756
+ "outputs": [
1757
+ {
1758
+ "data": {
1759
+ "text/plain": [
1760
+ "movie_id 440\n",
1761
+ "title Aliens vs Predator: Requiem\n",
1762
+ "tags a sequel to 2004' alien vs. predator, the icon...\n",
1763
+ "Name: 1216, dtype: object"
1764
+ ]
1765
+ },
1766
+ "execution_count": 48,
1767
+ "metadata": {},
1768
+ "output_type": "execute_result"
1769
+ }
1770
+ ],
1771
+ "source": [
1772
+ "new_df.iloc[1216]"
1773
+ ]
1774
+ },
1775
+ {
1776
+ "cell_type": "markdown",
1777
+ "id": "7fbbcb1c",
1778
+ "metadata": {},
1779
+ "source": [
1780
+ "## Exporting the Model"
1781
+ ]
1782
+ },
1783
+ {
1784
+ "cell_type": "code",
1785
+ "execution_count": 49,
1786
+ "id": "d8b99651",
1787
+ "metadata": {},
1788
+ "outputs": [],
1789
+ "source": [
1790
+ "import pickle"
1791
+ ]
1792
+ },
1793
+ {
1794
+ "cell_type": "code",
1795
+ "execution_count": 50,
1796
+ "id": "2d34f863",
1797
+ "metadata": {},
1798
+ "outputs": [],
1799
+ "source": [
1800
+ "pickle.dump(new_df,open('movies.pkl','wb'))"
1801
+ ]
1802
+ },
1803
+ {
1804
+ "cell_type": "code",
1805
+ "execution_count": 51,
1806
+ "id": "11c31baa",
1807
+ "metadata": {},
1808
+ "outputs": [],
1809
+ "source": [
1810
+ "pickle.dump(new_df.to_dict(),open('movie_dict.pkl','wb'))\n",
1811
+ " "
1812
+ ]
1813
+ },
1814
+ {
1815
+ "cell_type": "code",
1816
+ "execution_count": 52,
1817
+ "id": "0a7654a3",
1818
+ "metadata": {},
1819
+ "outputs": [],
1820
+ "source": [
1821
+ "pickle.dump(similarity,open('similarity.pkl','wb'))\n"
1822
+ ]
1823
+ }
1824
+ ],
1825
+ "metadata": {
1826
+ "kernelspec": {
1827
+ "display_name": "Python 3 (ipykernel)",
1828
+ "language": "python",
1829
+ "name": "python3"
1830
+ },
1831
+ "language_info": {
1832
+ "codemirror_mode": {
1833
+ "name": "ipython",
1834
+ "version": 3
1835
+ },
1836
+ "file_extension": ".py",
1837
+ "mimetype": "text/x-python",
1838
+ "name": "python",
1839
+ "nbconvert_exporter": "python",
1840
+ "pygments_lexer": "ipython3",
1841
+ "version": "3.9.7"
1842
+ }
1843
+ },
1844
+ "nbformat": 4,
1845
+ "nbformat_minor": 5
1846
+ }
app.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ #Craeting the Movie_Recommendation App
3
+ #Importing Dependencies
4
+ import pickle
5
+ import streamlit as st
6
+ import requests
7
+
8
+ #Fetching posters from https://www.themoviedb.org/.
9
+ def fetch_poster(movie_id):
10
+ url = "https://api.themoviedb.org/3/movie/{}?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US".format(movie_id)
11
+ data = requests.get(url)
12
+ data = data.json()
13
+ poster_path = data['poster_path']
14
+ full_path = "https://image.tmdb.org/t/p/w500/" + poster_path
15
+ return full_path
16
+
17
+ #Recommendation Model
18
+ def recommend(movie):
19
+ index = movies[movies['title'] == movie].index[0]
20
+ distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1])
21
+ recommended_movie_names = []
22
+ recommended_movie_posters = []
23
+ for i in distances[1:6]:
24
+ # fetch the movie poster
25
+ movie_id = movies.iloc[i[0]].movie_id
26
+ recommended_movie_posters.append(fetch_poster(movie_id))
27
+ recommended_movie_names.append(movies.iloc[i[0]].title)
28
+
29
+ return recommended_movie_names,recommended_movie_posters
30
+
31
+ st.markdown("<h1 style='text-align: center;background-color: teal; color: black;'>Movie Recommendation System</h1>", unsafe_allow_html=True)
32
+ st.markdown("<h4 style='text-align: center;background-color: LightYellow; color: black;'>Find a similar movie from a dataset of 5,000 movies!</h4>", unsafe_allow_html=True)
33
+
34
+ #loading the pickle files saved from ipynb file
35
+ movies = pickle.load(open('movies.pkl','rb'))
36
+ similarity = pickle.load(open('similarity.pkl','rb'))
37
+
38
+ #Creating an input box
39
+ movie_list = movies['title'].values
40
+ selected_movie = st.selectbox(
41
+ "Type or select a movie you like :",
42
+ movie_list
43
+ )
44
+
45
+ if st.button('Show Recommendation'):
46
+ st.write("Recommended Movies based on your interests are :")
47
+ recommended_movie_names,recommended_movie_posters = recommend(selected_movie)
48
+ col1, col2, col3, col4, col5 = st.columns(5)
49
+ with col1:
50
+ st.text(recommended_movie_names[0])
51
+ st.image(recommended_movie_posters[0])
52
+ with col2:
53
+ st.text(recommended_movie_names[1])
54
+ st.image(recommended_movie_posters[1])
55
+
56
+ with col3:
57
+ st.text(recommended_movie_names[2])
58
+ st.image(recommended_movie_posters[2])
59
+ with col4:
60
+ st.text(recommended_movie_names[3])
61
+ st.image(recommended_movie_posters[3])
62
+ with col5:
63
+ st.text(recommended_movie_names[4])
64
+ st.image(recommended_movie_posters[4])
65
+
66
+ st.title(" ")
movies.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a1399c7408299a483c9d831468ce07cf13408322d23eadc363e3d6d600da446
3
+ size 2281025
requirements.txt ADDED
Binary file (4.03 kB). View file
 
similarity.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4053da393ab855feb54e69f352748b13a0ed34bcdece4512abd055c9fc1d4c52
3
+ size 184781248