odedf2001 commited on
Commit
482c824
·
verified ·
1 Parent(s): 5286068

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -207
README.md CHANGED
@@ -1,207 +0,0 @@
1
- 🎬 Movie Revenue Prediction Project
2
- 📈 Regression → Feature Engineering → Clustering → Classification → Model Deployment
3
- 📦 Overview
4
-
5
- This project predicts movie revenue using both regression and classification models,
6
- powered by advanced feature engineering, clustering, and smart evaluation techniques.
7
-
8
- It was built as part of a Data Science assignment using the Movies Metadata dataset
9
- (Kaggle), processed and modeled in Google Colab.
10
-
11
- The final models are exported and published in a HuggingFace repository.
12
-
13
- 🗂️ 1. Dataset
14
-
15
- Source: Kaggle’s Movies Metadata dataset
16
-
17
- Rows after cleaning: ~5,300
18
-
19
- Original target: revenue
20
-
21
- Classification target (later): revenue_class (high vs. low revenue)
22
-
23
- 🔍 Main features used
24
-
25
- budget
26
-
27
- runtime
28
-
29
- vote_average
30
-
31
- vote_count
32
-
33
- popularity
34
-
35
- release_date → converted into release_year, decade
36
-
37
- overview → transformed into text length feature
38
-
39
- 🧹 2. Data Cleaning & Preprocessing
40
-
41
- ✔ Converted numeric fields to proper types
42
- ✔ Removed impossible values (zero budget/revenue/runtime)
43
- ✔ Parsed release_date into datetime
44
- ✔ Handled missing values
45
- ✔ Selected only meaningful rows for modeling
46
-
47
- 📊 3. Exploratory Data Analysis
48
- 📈 Budget vs Revenue
49
-
50
- Higher budget → generally higher revenue, though with big spread and outliers.
51
-
52
- ⏱️ Runtime vs Revenue
53
-
54
- No strong linear trend, but most successful films fall within typical runtime (80–150 mins).
55
-
56
- 🌍 Top Original Languages
57
-
58
- English overwhelmingly dominates the dataset.
59
-
60
- Each insight was supported by Matplotlib/Seaborn visualizations.
61
-
62
- 🧱 4. Baseline Regression Model
63
- 🎯 Goal
64
-
65
- Predict movie revenue using simple numeric features.
66
-
67
- 🧩 Features
68
-
69
- budget, runtime, vote_average, vote_count
70
-
71
- ⚙️ Model
72
-
73
- Linear Regression
74
-
75
- 📐 Metrics
76
-
77
- MAE, MSE, RMSE, R²
78
-
79
- 📝 Insight
80
-
81
- Good as a baseline, but not enough for real predictive power → motivates feature engineering.
82
-
83
- 🛠️ 5. Feature Engineering
84
-
85
- Created new features:
86
-
87
- profit = revenue – budget
88
-
89
- profit_ratio = profit / budget
90
-
91
- overview_length (text length)
92
-
93
- release_year, decade
94
-
95
- Encoded categoricals (original_language, status)
96
-
97
- Standardized numeric features using StandardScaler
98
-
99
- Added cluster-based features from K-Means:
100
-
101
- cluster_group
102
-
103
- distance_to_centroid
104
-
105
- This significantly improved model learning capabilities.
106
-
107
- 🎯 6. Clustering (K-Means + PCA)
108
- 🤖 Unsupervised Learning
109
-
110
- K-Means with k = 4
111
-
112
- Features: budget, runtime, vote stats, popularity, profit
113
-
114
- 🌀 PCA Visualization
115
-
116
- 2D scatter plot revealing structured groups:
117
-
118
- Low-budget films
119
-
120
- Mid-tier films
121
-
122
- High-budget blockbusters
123
-
124
- Clusters later used as new predictive features.
125
-
126
- 🚀 7. Improved Regression Models
127
-
128
- Trained 3 regression models:
129
-
130
- Linear Regression (improved)
131
-
132
- Random Forest Regressor
133
-
134
- Gradient Boosting Regressor ← 🏆 Winner
135
-
136
- 🏆 Winning Model
137
-
138
- Gradient Boosting Regressor
139
-
140
- Why?
141
-
142
- Best R²
143
-
144
- Lowest MAE & RMSE
145
-
146
- Handles non-linear relationships beautifully
147
-
148
- Exported as:
149
- winning_model.pkl
150
-
151
- 🔄 8. Regression → Classification
152
-
153
- The regression target was reframed into a binary classification problem:
154
-
155
- 🎚️ Creating revenue_class
156
-
157
- Median split
158
-
159
- Class 0 → below median
160
-
161
- Class 1 → at or above median
162
-
163
- ⚖️ Class Balance
164
-
165
- Perfectly balanced (~50/50).
166
-
167
- 🧠 Business Reasoning
168
-
169
- Precision is more important than recall
170
-
171
- False Positives are more dangerous than False Negatives
172
- Predicting a movie as high-revenue when it won’t be → wastes millions.
173
-
174
- 🤖 9. Classification Models
175
-
176
- Trained 3 classifiers:
177
-
178
- Logistic Regression
179
-
180
- Random Forest Classifier
181
-
182
- Gradient Boosting Classifier ← 🏆 Winner
183
-
184
- 🧪 Metrics Evaluated:
185
-
186
- Accuracy
187
-
188
- Precision
189
-
190
- Recall
191
-
192
- F1-score
193
-
194
- Classification report
195
-
196
- Confusion matrix
197
-
198
- 🏆 Winning Model: Gradient Boosting Classifier
199
-
200
- Highest precision (0.990)
201
-
202
- Highest F1-score (0.990)
203
-
204
- Lowest rate of harmful errors
205
-
206
- Exported as:
207
- winning_classifier.pkl