🍽️ Swiggy Restaurant Rating Predictor — Classification, Regression & Clustering

Reichman University | Adelson School of Entrepreneurship | Introduction to Data Science | 2026

🎬 Video Presentation

📌 Project Overview

This project applies a full end-to-end data science pipeline to the Swiggy Restaurants Dataset — a dataset of over 107,000 restaurants across India, collected from the Swiggy food delivery platform.

"Can we predict a restaurant's rating based on its features — cuisine, price, location, and popularity?"


Research Question	Can we predict a restaurant's rating based on its features?
Dataset	Swiggy Restaurants Dataset — Kaggle
Dataset Size	~107,000 restaurants across India
Target Variable	`Rating` — customer rating (1.0–5.0)
Task Types	Regression + Classification + Clustering

📊 Dataset

Source: Kaggle — Swiggy Restaurants Dataset
Size: 107,000 rows × 6 columns
Numeric features: Average Price, Number of Ratings, Number of Offers
Categorical features: Cuisine, Location, Pure Veg
Target (Regression): Rating — customer rating (1.0–5.0)
Target (Classification): Rating converted to 3 balanced classes — Low, Medium, High

🧹 Part 2: Data Cleaning & EDA

Data Cleaning

Removed 33,719 rows with missing Rating values — our target variable
Converted Average Price from text format (e.g. "₹250 for two") to numeric
Converted Number of Ratings from text format (e.g. "10+ ratings") to numeric
Encoded categorical columns: Cuisine, Location, Pure Veg
Final clean dataset: ~107,000 rows ready for modeling

🔍 Outlier Detection

The Rating distribution is slightly left-skewed, with most values between 3.5 and 4.5. Ratings outside the valid 1.0–5.0 range were treated as invalid and removed during cleaning. No extreme outliers were found in numeric features.

Decision: Keep all valid data — no artificial capping was applied.

❓ Question 1: What is the average rating by cuisine type?

Answer: All top 10 cuisine types have very similar average ratings, clustered tightly around 4.0. South Indian and Chinese cuisines rank slightly higher, but the differences are minimal — confirming that cuisine type alone does not determine a restaurant's success on Swiggy.

❓ Question 2: Does price affect rating?

Answer: There is almost no correlation between price and rating (r = 0.06). Expensive restaurants (₹1,000+) are rated no higher than budget ones (₹100–300). Swiggy customers rate based on experience, not price tag.

❓ Question 3: Do vegetarian restaurants get better ratings?

Answer: Pure vegetarian and non-vegetarian restaurants receive nearly identical average ratings (~4.0). Being vegetarian gives no rating advantage on Swiggy — quality matters more than dietary category.

❓ Question 4: Which cities have the most restaurants on Swiggy?

Answer: The top 10 cities each have between 1,400–1,600 restaurants on Swiggy. Kanchipuram and Kanpur lead slightly. Swiggy has broad and balanced geographic coverage across India — no single city dominates the platform.

❓ Question 5: Do more popular restaurants get better scores?

Answer: Restaurants with more ratings tend to cluster around stable scores of 3.5–4.5, while restaurants with very few ratings show more extreme and unreliable scores. Popularity stabilizes ratings — but does not guarantee higher ones. Social proof plays a role in anchoring expectations.

🌍 BONUS: Correlation Heatmap

Note: The interactive version of this heatmap is available in the notebook.

Average Price has very weak correlation with Rating (r = 0.06)
Number of Ratings has slightly more influence on Rating (r = 0.07)
No single numeric feature strongly predicts the rating alone

This finding motivated the entire feature engineering approach in Part 4.

⚙️ Part 3: Baseline Model

Before building complex models, we established a Linear Regression baseline using only raw features — no engineering, no transformations. This gives us a clear reference point to measure how much our improvements actually help.

Metric	Value
MAE	0.3635
RMSE	0.4916
R²	0.0114

The baseline explains only 1.1% of the variance in ratings — a humble but expected result. Most restaurants cluster tightly between 3.5–4.5, leaving little signal for a simple linear model to learn from.

The model struggles to predict outside the 4.0–4.5 range — confirming that raw features alone are not enough. Most predictions cluster around the mean, missing the full range of actual ratings.

All feature importance values are extremely low — confirming that raw features alone are not enough to predict restaurant ratings. This motivated the entire feature engineering approach in Part 4.

Challenge set: Can we do significantly better with feature engineering and more powerful models?

🔧 Part 4: Feature Engineering

Raw data alone is rarely enough. i have engineered 4 new features designed to capture patterns the original columns couldn't express:

Feature	Description	Intuition
`Is_Expensive`	Price above ₹300	Captures the premium restaurant segment
`Has_Many_Offers`	4 or more offers available	Captures promotional activity level
`Is_Popular`	More than 50 ratings	Captures established vs new restaurants
`Cluster`	K-Means cluster ID (k=4)	Groups similar restaurants by behavioral profile

K-Means Clustering (k=4)

Applied K-Means clustering on Average Price, Number of Ratings, and Number of Offers to automatically group restaurants into 4 behavioral profiles — unsupervised learning working alongside our supervised models:

Cluster 0: Budget restaurants with few ratings
Cluster 1: Mid-range restaurants
Cluster 2: Popular restaurants with many ratings
Cluster 3: High-offer restaurants

📐 BONUS: Elbow Method

Used the Elbow Method to scientifically validate K=4 as the optimal number of clusters — the inertia drops sharply before K=4 and plateaus after, confirming our choice.

💡 Feature Engineering Insights

Number of Ratings — remained the top feature in both regression and classification
Location — consistently the second most important feature
Cluster — contributed meaningful predictive signal, proving clustering added real value
Binary features (Is_Expensive, Is_Popular) helped the model distinguish restaurant segments

Business Insight: WHERE a restaurant is located and HOW POPULAR it already is matter far more than what it serves or how expensive it is. Location and reputation drive ratings more than menu or pricing strategy.

🤖 Part 5: Regression Models

Trained 3 different regression models on the engineered dataset and compared them against the baseline — an iterative improvement process:

Model	MAE	R²
Linear Regression (Baseline)	0.3635	0.0114
Linear Regression (Improved)	0.3623	0.0160
Random Forest	0.3481	0.0510
Gradient Boosting ✅	0.3463	0.0776

🏆 Winner: Gradient Boosting (MAE = 0.3463, R² = 0.0776)

Why Gradient Boosting wins: It builds trees iteratively, each one correcting the mistakes of the previous — better at capturing the subtle patterns in tightly-clustered restaurant ratings.

Feature Importance

Top predictors:

Number of Ratings (#1) — more reviewed restaurants are more predictable
Location (#2) — where a restaurant is matters more than what it serves
Cuisine (#3) — cuisine type has some influence
Cluster (#4) — our engineered feature added real predictive value!

BONUS: Residual Analysis

Residuals vs Predicted: Residuals centered around 0 — the model is unbiased
Distribution of Residuals: Near-normal distribution — the model makes symmetric errors
The spread reflects the inherent difficulty of predicting ratings in the narrow 3.5–4.5 range

BONUS: Hyperparameter Tuning

Used GridSearchCV with 3-fold cross validation to find optimal parameters:

Best params: learning_rate=0.05, max_depth=5, n_estimators=100
Best MAE: 0.3497 — confirms our initial parameters were already well-chosen

💾 Part 6: Saved Model

The winning Gradient Boosting Regressor was saved and uploaded to this HuggingFace repository.


Model	Gradient Boosting Regressor
File	`swiggy_model.pkl`
MAE	0.3463
R²	0.0776

🏷️ Part 7: Regression → Classification

Converted Rating into 3 meaningful classes using quantile binning:

Class	Definition	% of Data
0 — Low	Bottom 33%	33.8%
1 — Medium	Middle 33%	42.6%
2 — High	Top 33%	23.7%

Why quantile binning? Creates balanced classes automatically and divides restaurants into meaningful Low / Medium / High groups.

Why F1 over accuracy? The dataset has mild class imbalance — a model predicting "Medium" for everything would get 42% accuracy without learning anything. F1 (weighted) is a fairer metric.

Why Recall matters more here: It's worse to miss a truly great restaurant (False Negative) than to occasionally recommend a mediocre one (False Positive). False Negative is more critical — predicting LOW when the restaurant is actually HIGH means hiding good restaurants from users.

🧠 Part 8: Classification Models

Trained 3 different classification models to predict restaurant rating classes (Low / Medium / High):

Model	Accuracy	F1 (weighted)
Logistic Regression	0.43	0.33
Random Forest	0.45	0.46
Gradient Boosting ✅	0.50	0.46

🏆 Winner: Gradient Boosting Classifier — best accuracy AND best F1 score.

Confusion Matrices

Key observations:

Logistic Regression — predicted almost everything as "Medium", struggled with class separation
Random Forest — more balanced predictions across all 3 classes
Gradient Boosting — best overall accuracy (50%) with most correct predictions

Feature Importance (Classification)

Number of Ratings (#1) — most reviewed restaurants are most predictable
Cuisine_encoded (#2) — cuisine type plays a strong role in rating class
Location_encoded (#3) — where a restaurant is matters
Cluster (#4) — our engineered feature contributed real signal!

🎯 BONUS: Hyperparameter Tuning — Before vs After

Applied GridSearchCV with 12 combinations and 3-fold cross validation:

	Before Tuning	After Tuning
F1 (weighted)	0.46	0.58
Accuracy	0.50	0.50
Best Params	—	`n_estimators=200`, `max_depth=None`, `min_samples_split=5`

+26% improvement in F1 score from tuning alone.

💡 Part 8 Summary

Model	Accuracy	F1 (weighted)
Logistic Regression	0.43	0.33
Random Forest	0.45	0.46
Gradient Boosting	0.50	0.46
Gradient Boosting (Tuned) ✅	0.50	0.58

This consistency across both regression and classification models confirms that our feature engineering choices were solid and well-justified.

📦 Repository Contents

File	Description
`README.md`	This file
`Copy_of_Assignment_2_...ipynb`	Full Colab notebook with all code and outputs
`swiggy_model.pkl`	Winning Gradient Boosting Regression model
`swiggy_classifier.pkl`	Winning Gradient Boosting Classification model

📝 Project Summary

This project demonstrates a complete data science pipeline applied to the Swiggy Restaurants Dataset — over 107,000 restaurants across India. Starting from raw restaurant data, we built a system that predicts customer ratings and classifies restaurants into Low, Medium, and High rated groups.

The combination of feature engineering, clustering, and hyperparameter tuning resulted in a ~600% improvement over the baseline regression model (R² from 0.0114 → 0.0776), and a 26% improvement in classification F1 (0.46 → 0.58) through tuning alone.

Author: Amit Ben Avraham | Reichman University — Adelson School of Entrepreneurship | Introduction to Data Science | 2026

🤖 AI Usage Disclosure

This project was completed with assistance from Claude (Anthropic) for code debugging, chart design, and README writing. All analysis, decisions, and interpretations are my own.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support