Austin Housing Price Prediction Models

This repository contains two trained machine learning models created for a first-year Data Science assignment.

The project focuses on predicting housing prices in Austin, Texas using machine learning techniques, including regression, clustering-based feature engineering, and classification.

Files in this Repository

  • regression_model.pkl — trained regression model for predicting housing prices.
  • classification_model.pkl — trained classification model for classifying houses into price categories.

Dataset

The models were trained on the Austin housing dataset.

Each row in the dataset represents a property listing.
The dataset includes features such as:

  • Living area
  • Number of bedrooms
  • Number of bathrooms
  • Property type
  • School rating
  • Geographic location
  • Property tax rate
  • Sale date
  • Latest property price

The main target variable for the regression task is latestPrice.

Regression Task

The regression task predicts the estimated price of a property.

Several regression models were tested:

  • Baseline Linear Regression
  • Ridge Regression
  • Random Forest Regressor
  • Gradient Boosting Regressor

The selected regression model is:

Gradient Boosting Regressor

It was selected because it achieved the lowest Mean Absolute Error (MAE), which is the most practical metric for a housing price prediction task.

Approximate regression results:

Model MAE
Baseline Linear Regression 0.2347 $163,697.27
Ridge Regression 0.3573 $135,460.67
Random Forest Regressor 0.2441 $119,029.93
Gradient Boosting Regressor 0.2603 $117,222.75

Although Ridge Regression achieved the highest R² score, Gradient Boosting was selected as the preferred operational model because it achieved the lowest average dollar-level prediction error.

Clustering

K-Means clustering was used as part of the feature engineering process.

An exploratory clustering step with k=6 was used for geographic visualization.
The final clustering value used in the model pipeline was k=4, based on the elbow analysis.

Cluster-based features were added to help the models capture geographic and structural market patterns.

Classification Task

The regression problem was also converted into a classification problem.

Instead of predicting the exact price, the properties were divided into three price categories:

  • Affordable
  • Mid-Range
  • Luxury

The classification models tested were:

  • Logistic Regression
  • Random Forest Classifier
  • Gradient Boosting Classifier

The selected classification model is:

Gradient Boosting Classifier

Approximate classification results:

Model Macro F1 Macro ROC-AUC
Logistic Regression ~0.72 ~0.90
Random Forest Classifier ~0.79 ~0.93
Gradient Boosting Classifier ~0.80 ~0.936

Gradient Boosting Classifier was selected because it achieved the strongest overall results across the reported classification metrics.

Intended Use

These models were created for an academic Data Science assignment.

They are intended to demonstrate:

  • Data cleaning
  • Exploratory Data Analysis
  • Feature engineering
  • Clustering
  • Regression modeling
  • Classification modeling
  • Model evaluation
  • Saving trained models with Pickle

Limitations

These models should not be used for real financial or real estate decisions.

The predictions are based on a specific dataset and may not generalize to other cities, time periods, or real estate markets.

The models are intended for educational purposes only.

Author

Created by Bar Wachsman as part of a Data Science assignment.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support