YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Video

Pakistan House Price Prediction โ€” Classification, Regression, Clustering, Evaluation Project

Overview

To what extent can we accurately predict residential property prices in Pakistan based on features such as location, square footage, and the number of bedrooms? This project builds a full machine learning pipeline to analyze and predict property values using regression models and clustering.

Dataset Description

Source Kaggle โ€” Pakistan House Price Prediction (https://www.kaggle.com/datasets/ebrahimhaquebhatti/pakistan-house-price-prediction)

Author Ebrahim Haque Bhatti

Size 168,446 rows ร— 18 features

Target Variable price (rent / sale price of a house)

Numeric features: price, latitude, longitude, bath, bedrooms, date_added, total_area

Categorical features: property_type, location, city, province_name, purpose, agency, agent

Exploratory Data Analysis (EDA)

1. Data Cleaning & Handling

Missing values:

Agency and agent columns: they had over 44,000 missing values.

Instead of deleting these columns, I labeled the missing parts as 'Unknown'. This way, I keep all my data, and the model can check if being with a specific agency/agent changes the house price.

Datetime Parsing:

The date_added feature was converted to datetime format to extract year_listed and month_listed, enabling the model to account for time-based market fluctuations.

Typos / impossible values:

Prices were converted from PKR to ILS to make the financial data more relatable and easier to analyze.

image

After reviewing the descriptive statistics, several impossible or extreme values were identified that would introduce significant noise into the model:

Properties with a price of 0 were removed because they represent incomplete listings or data entry errors.

Extreme Room Counts: Properties with more than 15 bathrooms or 15 bedrooms (such as a record with 403 baths) were removed. These were treated as data entry errors or highly unique properties that do not represent the general residential market. Removing them allows the model to better understand the relationship between typical home features and price.

Zero-Area Correction: Properties listed with a Total_Area of 0 were removed, as a property with no cubic footage is not relevant for this price prediction analysis.

I decided not to delete properties with 0 bedrooms or bathrooms because these aren't necessarily errors. In many cases, these listings represent land where no building exists yet. They could also be studio apartments.

2. Outlier Detection & Handling

All outliers were identified using boxplots, and extreme outliers were removed (the high 1%).

By removing only the high 1%, I am ensuring the model is an expert on the standard housing market rather than being distracted by a few unique palaces. If I delete every expensive house, the model will never be able to predict the price of the luxury market.

Bedrooms:

image

The boxplot analysis confirms a positive correlation between bedrooms count and property price up to 5 units. This shows that the number of bedrooms isn't the only factor that indicates a house price.

Bath:

image

The boxplot analysis confirms a positive correlation between bathroom count and property price up to 8 units. This shows that the number of bathrooms isn't the only factor that indicates a house price.

Year Listed:

image

While the overall market median remaines stable between 2018 and 2019, 2019 has a broader distribution of prices. This shows that the year listed of a property isn't the only factor that indicates a house price.

Month Listed:

image

The median prices are quite stable (there isn't any major jump). This shows that the month listed of a property isn't the only factor that indicates a house price.

3. Descriptive Statistics

The average listing:

image

The average listing in this dataset:

Baths: 3

Bedrooms: 3

Listed on June 2019

Price in ILS: 75,268

4. Visualizations

Correlation Heatmap:

image

Bedrooms and baths have the strongest positive correlation with price.

Histogram โ€” Distribution of Price:

image

Most houses sit in a lower price range (usually under 10 thousand shekels). While there are some luxury houses, they are rare.

Scatter Plot โ€” Bedrooms vs. Price:

image

There is a positive correlation. As the number of bedrooms increases, the price generally rises. The vertical spread of dots at each bedroom count suggests that other factors play a significant role in pricing.

Answer to my research question:

The two strongest predictors of a price of a house are the number of bedrooms and baths.

5. Research Questions & Answers

Q1: Question 1: Does a big amount of land automatically mean a higher house price?

image

Answer: no

The red line goes up a little us the area gets bigger. This shows that there is a trend of bigger is more expensive. On the other hand, the heavy clustering at the bottom left proves that for the majority of homes, the area isn't the main thing driving the price.

Q2: Does the property type affect the price?

image

Answer: yes

Farm Houses are the most expensive category but also unpredictable, shown by the long vertical error bar (also penthouse). Flats and houses have very consistent pricing, and rooms and portions represent the lowest prices.

Q3: What city has the most expensive houses?

image

Answer:Lahore. I used barplot to check what city has the most expensive properties.

Q4: Does the time of listing (date_added) influence property prices?

image

Answer: maybe

There is a major price dip in January 2019 followed by a consistent upward trend peaking in June 2019. This suggests that the model might account for time based trends, but other features might be affecting the price as well.

Training a Baseline Model

Regression Goal:

Predict the price of a property in Pakistan based on its features.

Feature Selection

  1. Numeric- bedrooms, baths, Total_Area.
  2. Categorical- price, city, property_type

Linear Regression Model

I used a linear regression model as my baseline. The model calculated the mathematical weights for each feature in my training set.

After training the model, I used standard regression metrics to evaluate how well it predicts property prices on the unseen test set.

R2 Score: 0.2828. Indicates that property price is influenced by many other factors. To improve accuracy, further feature engineering or more complex algorithms may be required.

Insights:

ืฆื™ืœื•ื ืžืกืš 2026-05-02 ื‘-17.10.53

There is a clear positive correlation, indicating the baseline model captures the basic relationship between property features and price.

However, the significant dispersion of data points around the identity line shows that the current features do not fully account for price volatility.

Feature Engineering

Creating New Features

To improve model performance beyond the raw dataset features, several engineered features were created:

By implementing a preprocessing pipeline and polynomial interactions, I transformed the original 5 features into a high dimensional dataset of 120 features.

This approach enables the model to understand complex interactions such as how the value of an additional square meter varies by city, leading to a more accurate price prediction.

Clustering

image

The PCA plot shows that most properties are concentrated in a dense line along the bottom, representing the standard housing market. The isolated dots high on the Y-axis (especially the light green dot for Cluster 3) represent significant outliers. This visual confirms that the clustering algorithm successfully separated typical homes from extreme or unique properties.

Three Improved Models

3 regression models were trained and evaluated:

Baseline Linear Regression (with engineered features)

MAE: 66,585 | R2: 0.33 (was 0.28)

Gradient Boosting Regressor with Engineered Features

MAE: 48,066 | R2: 0.57

Random Forest Regressor with Engineered Features

MAE: 45,296 | R2: 0.59

ืฆื™ืœื•ื ืžืกืš 2026-05-02 ื‘-17.29.10

Random Forest is the winning regression model

Achieving an R2 of 0.59, a massive improvement over the 0.28 baseline.

The drop in both MAE and RMSE across ensemble models proves that feature engineering and clustering provided meaningful context for understanding complex pricing patterns.

Visualize Feature Importance

ืฆื™ืœื•ื ืžืกืš 2026-05-02 ื‘-17.31.57

The fact that several cluster features rank in the top 15 validates our strategy of using unsupervised learning to provide the model with better market context.

Binning

I converted the price target into three categorical tiers to simplify the problem from predicting an exact price to predicting a price range. This makes the model more robust against price outliers.

Class 0 (Low): Below 26,881.72 ILS.

Class 1 (Mid): Between 26,881.72 and 129,032.26 ILS.

Class 2 (High): Above 129,032.26 ILS.

Train & Eval Classification Models

Q1: In the context of your dataset/task, explain what would be more importatnt - precision or recall?

I believe Precision is slightly more important.

In a real estate, it is better for the model to be sure about a price tier before recommending it to a user.

A high-precision model builds more trust because the predicted price tier will almost always match the reality of the property's value.

Q2: In the context of your dataset/task, explain what would be more critical - False Positive or False Negative?

False Positive is more critical.

Whether a user is looking for a luxury High Tier home or a budget Low Tier home, the model's credibility depends on its predictions being accurate.

Providing a False Positive tells a user a house fits their price segment when it actually doesn't.

This creates a poor user experience and financial risk.

Train three different kinds of classification models:

1. K-Nearest Neighbors (KNN):

ืฆื™ืœื•ื ืžืกืš 2026-05-02 ื‘-17.52.20

Accuracy: 0.73

2. Naive Bayes:

ืฆื™ืœื•ื ืžืกืš 2026-05-02 ื‘-17.52.40

Accuracy: 0.57

3.Multilayer Perceptron:

ืฆื™ืœื•ื ืžืกืš 2026-05-02 ื‘-17.53.01

Accuracy: 0.64

ืฆื™ืœื•ื ืžืกืš 2026-05-02 ื‘-17.46.34

Insights

  1. Bedrooms and baths: are the two strongest numeric predictors of property price.
  2. Cluster features: ranked among the top 15 most important features, validating the feature engineering strategy.
  3. Ensemble models (Random Forest, Gradient Boosting) dramatically outperform linear baselines, confirming that pricing patterns in this dataset are highly non-linear.
  4. The average property in this dataset is priced at 75,268 ILS, with 3 bedrooms, 3 bathrooms, and listed in June 2019.

Overall:

Property price prediction in Pakistan is driven primarily by structural features (bedrooms, baths, area) and market context (location, cluster segment).

Ensemble regression models and proximity-based classifiers best capture the complex, non-linear relationships in this market.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support