odedf2001
/

movies_metadata.csv

@@ -1,207 +0,0 @@
-🎬 Movie Revenue Prediction Project
-📈 Regression → Feature Engineering → Clustering → Classification → Model Deployment
-📦 Overview
-This project predicts movie revenue using both regression and classification models,
-powered by advanced feature engineering, clustering, and smart evaluation techniques.
-It was built as part of a Data Science assignment using the Movies Metadata dataset
-(Kaggle), processed and modeled in Google Colab.
-The final models are exported and published in a HuggingFace repository.
-🗂️ 1. Dataset
-Source: Kaggle’s Movies Metadata dataset
-Rows after cleaning: ~5,300
-Original target: revenue
-Classification target (later): revenue_class (high vs. low revenue)
-🔍 Main features used
-budget
-runtime
-vote_average
-vote_count
-popularity
-release_date → converted into release_year, decade
-overview → transformed into text length feature
-🧹 2. Data Cleaning & Preprocessing
-✔ Converted numeric fields to proper types
-✔ Removed impossible values (zero budget/revenue/runtime)
-✔ Parsed release_date into datetime
-✔ Handled missing values
-✔ Selected only meaningful rows for modeling
-📊 3. Exploratory Data Analysis
-📈 Budget vs Revenue
-Higher budget → generally higher revenue, though with big spread and outliers.
-⏱️ Runtime vs Revenue
-No strong linear trend, but most successful films fall within typical runtime (80–150 mins).
-🌍 Top Original Languages
-English overwhelmingly dominates the dataset.
-Each insight was supported by Matplotlib/Seaborn visualizations.
-🧱 4. Baseline Regression Model
-🎯 Goal
-Predict movie revenue using simple numeric features.
-🧩 Features
-budget, runtime, vote_average, vote_count
-⚙️ Model
-Linear Regression
-📐 Metrics
-MAE, MSE, RMSE, R²
-📝 Insight
-Good as a baseline, but not enough for real predictive power → motivates feature engineering.
-🛠️ 5. Feature Engineering
-Created new features:
-profit = revenue – budget
-profit_ratio = profit / budget
-overview_length (text length)
-release_year, decade
-Encoded categoricals (original_language, status)
-Standardized numeric features using StandardScaler
-Added cluster-based features from K-Means:
-cluster_group
-distance_to_centroid
-This significantly improved model learning capabilities.
-🎯 6. Clustering (K-Means + PCA)
-🤖 Unsupervised Learning
-K-Means with k = 4
-Features: budget, runtime, vote stats, popularity, profit
-🌀 PCA Visualization
-2D scatter plot revealing structured groups:
-Low-budget films
-Mid-tier films
-High-budget blockbusters
-Clusters later used as new predictive features.
-🚀 7. Improved Regression Models
-Trained 3 regression models:
-Linear Regression (improved)
-Random Forest Regressor
-Gradient Boosting Regressor ← 🏆 Winner
-🏆 Winning Model
-Gradient Boosting Regressor
-Why?
-Best R²
-Lowest MAE & RMSE
-Handles non-linear relationships beautifully
-Exported as:
-winning_model.pkl
-🔄 8. Regression → Classification
-The regression target was reframed into a binary classification problem:
-🎚️ Creating revenue_class
-Median split
-Class 0 → below median
-Class 1 → at or above median
-⚖️ Class Balance
-Perfectly balanced (~50/50).
-🧠 Business Reasoning
-Precision is more important than recall
-False Positives are more dangerous than False Negatives
-Predicting a movie as high-revenue when it won’t be → wastes millions.
-🤖 9. Classification Models
-Trained 3 classifiers:
-Logistic Regression
-Random Forest Classifier
-Gradient Boosting Classifier ← 🏆 Winner
-🧪 Metrics Evaluated:
-Accuracy
-Precision
-Recall
-F1-score
-Classification report
-Confusion matrix
-🏆 Winning Model: Gradient Boosting Classifier
-Highest precision (0.990)
-Highest F1-score (0.990)
-Lowest rate of harmful errors
-Exported as:
-winning_classifier.pkl