Austin Housing Price Prediction Models

This repository contains two trained machine learning models created for a first-year Data Science assignment.

The project focuses on predicting housing prices in Austin, Texas using machine learning techniques, including regression, clustering-based feature engineering, and classification.

Files in this Repository

regression_model.pkl — trained regression model for predicting housing prices.
classification_model.pkl — trained classification model for classifying houses into price categories.

Dataset

The models were trained on the Austin housing dataset.

Each row in the dataset represents a property listing.
The dataset includes features such as:

Living area
Number of bedrooms
Number of bathrooms
Property type
School rating
Geographic location
Property tax rate
Sale date
Latest property price

The main target variable for the regression task is latestPrice.

Regression Task

The regression task predicts the estimated price of a property.

Several regression models were tested:

Baseline Linear Regression
Ridge Regression
Random Forest Regressor
Gradient Boosting Regressor

The selected regression model is:

Gradient Boosting Regressor

It was selected because it achieved the lowest Mean Absolute Error (MAE), which is the most practical metric for a housing price prediction task.

Approximate regression results:

Model	R²	MAE
Baseline Linear Regression	0.2347	$163,697.27
Ridge Regression	0.3573	$135,460.67
Random Forest Regressor	0.2441	$119,029.93
Gradient Boosting Regressor	0.2603	$117,222.75

Although Ridge Regression achieved the highest R² score, Gradient Boosting was selected as the preferred operational model because it achieved the lowest average dollar-level prediction error.

Clustering

K-Means clustering was used as part of the feature engineering process.

An exploratory clustering step with k=6 was used for geographic visualization.
The final clustering value used in the model pipeline was k=4, based on the elbow analysis.

Cluster-based features were added to help the models capture geographic and structural market patterns.

Classification Task

The regression problem was also converted into a classification problem.

Instead of predicting the exact price, the properties were divided into three price categories:

Affordable
Mid-Range
Luxury

The classification models tested were:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier

The selected classification model is:

Gradient Boosting Classifier

Approximate classification results:

Model	Macro F1	Macro ROC-AUC
Logistic Regression	~0.72	~0.90
Random Forest Classifier	~0.79	~0.93
Gradient Boosting Classifier	~0.80	~0.936

Gradient Boosting Classifier was selected because it achieved the strongest overall results across the reported classification metrics.

Intended Use

These models were created for an academic Data Science assignment.

They are intended to demonstrate:

Data cleaning
Exploratory Data Analysis
Feature engineering
Clustering
Regression modeling
Classification modeling
Model evaluation
Saving trained models with Pickle

Limitations

These models should not be used for real financial or real estate decisions.

The predictions are based on a specific dataset and may not generalize to other cities, time periods, or real estate markets.

The models are intended for educational purposes only.

Author

Created by Bar Wachsman as part of a Data Science assignment.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support