House Prices - Tabular Models (CatBoost + XGBoost baseline)

Pre-trained baseline models for the t22000t/house-prices-tabular dataset, produced by the tabular-data-modelling-pipeline.

This is the v1 baseline drop - CatBoost + XGBoost trained with sensible defaults (no Optuna tuning) on an 80/20 random split. A follow-up release will add the six deep-learning architectures (CANN, CANN-GBM, FT-Transformer, TabM, LocalGLMnet, DRN) once they're retrained on this dataset.

Results

Model Test Gini Train Gini Test MAE (USD) Test RMSE (USD) A/E ratio n params Training time
CatBoost 0.2061 0.2203 16,868 27,063 1.025 1,041 trees 4.4 s
XGBoost 0.2049 0.2212 17,204 29,716 0.999 462 trees 0.3 s
Stacked ensemble (NNLS) 0.2049 0.2212 17,204 29,716 0.999 (2 weights) -
  • Test set: 304 rows (20% of 1,460)
  • Target: SalePrice (USD)
  • Loss: Gamma deviance (gamma family, log link)
  • Target cap: 99.5th percentile = $555,355 (6 rows winsorised)
  • Random seed: 42

The NNLS-stacked ensemble currently degenerates to XGBoost; with more diverse base learners (the upcoming DL drop) it will pick a non-trivial blend.

Files

File What it is Size
catboost.cbm Trained CatBoost model (native format) 1.2 MB
xgboost.json Trained XGBoost Booster (native JSON format) 1.3 MB
evaluation_summary.csv Per-model train/test Gini, MAE, RMSE, A/E ratio, gamma deviance 315 B
ensemble_weights.json NNLS-stacked weights over base predictions 53 B
dashboard_dl_models.html Interactive Plotly dashboard (Lorenz curves, calibration deciles, ensemble plots) 4.6 MB
figures/fig_dl_*.png Publication-quality figures matching the dashboard ~6 MB total
model_summary.json Structured run record (config, metrics, timing) 3.2 KB

Loading and inference

CatBoost

from huggingface_hub import hf_hub_download
from catboost import CatBoostRegressor
import pandas as pd

path = hf_hub_download(
    repo_id="t22000t/house-prices-tabular-models",
    filename="catboost.cbm",
)
model = CatBoostRegressor()
model.load_model(path)

# Load the dataset and predict
df = pd.read_csv("hf://datasets/t22000t/house-prices-tabular/train.csv")
# Use only the columns the model was trained on (see model_summary.json)
features = [
    "LotArea", "YearBuilt", "YearRemodAdd", "TotalBsmtSF", "1stFlrSF",
    "2ndFlrSF", "GrLivArea", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd",
    "GarageCars", "GarageArea", "OverallQual", "OverallCond",
    "MSZoning", "Street", "LotShape", "Neighborhood", "BldgType",
    "HouseStyle", "RoofStyle", "ExterQual", "Foundation", "Heating",
    "CentralAir", "KitchenQual", "SaleType", "SaleCondition",
]
preds = model.predict(df[features])

XGBoost

from huggingface_hub import hf_hub_download
import xgboost as xgb

path = hf_hub_download(
    repo_id="t22000t/house-prices-tabular-models",
    filename="xgboost.json",
)
booster = xgb.Booster()
booster.load_model(path)

# Predictions require the exact feature order used at training time;
# easiest path is to re-run the pipeline's preprocessing - see the
# GitHub repo for the full feature build code.

Stacked ensemble

import json
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="t22000t/house-prices-tabular-models",
    filename="ensemble_weights.json",
)
weights = json.loads(open(path).read())
# weights = {"catboost": 0.0, "xgboost": 1.0}  (NNLS picked XGBoost only)
ensemble_pred = weights["catboost"] * cb_pred + weights["xgboost"] * xgb_pred

Training configuration

Setting Value
Pipeline tabular-data-modelling-pipeline v0.1.0
Architecture mix CatBoost + XGBoost (DL models excluded from this drop)
Hyperparameters Defaults (see modelling/models/__init__.py) - no Optuna tuning
Optimiser CatBoost: ordered boosting; XGBoost: hist tree method
Family / link Gamma / log
Train/test split Random 80/20, seed 42
Cap percentile 99.5
CV folds 5 (for stability check)
Hardware Apple M-series, CPU

To reproduce exactly, run:

git clone https://github.com/timothy22000/tabular_data_modelling_pipeline
cd tabular_data_modelling_pipeline
pip install -e ".[gbm,viz]"
python scripts/download_data.py --dataset house_prices

OMP_NUM_THREADS=1 python train.py \
    --config configs/example_house_prices.py \
    --input data/house_prices.csv \
    --skip-tuning --skip-interpretability \
    --architectures catboost xgboost

(OMP_NUM_THREADS=1 is only needed on macOS arm64 to avoid an OpenMP conflict between XGBoost and Python's threading; Linux runs are unaffected.)

Limitations

  • Defaults only. No hyperparameter tuning - tuned models would close the train-test gap and likely lift Gini by 0.02-0.05.
  • GBM only. This drop omits the six DL architectures. CANN-GBM in particular would likely outperform raw XGBoost since it adds a neural residual on top of the GBM base. v2 will include these.
  • Random split, not stratified. SalePrice has a heavy right tail; a stratified split (or quantile-stratified) would give a more representative test set. Default behaviour, kept for reproducibility.
  • Trained on training set only. The Kaggle competition's test.csv is unlabelled and not used here. To compare against the official leaderboard, train on the full set and submit predictions on test.
  • Gini scores look modest. Gini in [0.20, 0.22] is reasonable for this dataset's modest signal-to-noise ratio - Kaggle leaderboard RMSLE is the more conventional metric for House Prices, but the pipeline uses Gini and MAE for cross-comparability across architectures and datasets.

Intended use

  • Baseline for tabular DL research. Comparing your new architecture against these numbers.
  • Teaching. Demonstrating a calibrated tabular pricing pipeline end to end.
  • Sanity check. Make sure your reimplementation of CatBoost/XGBoost on this data hits similar numbers.

Citation

@software{tabular_data_modelling_pipeline,
  author = {Mun, Timothy},
  title  = {tabular-data-modelling-pipeline},
  url    = {https://github.com/timothy22000/tabular_data_modelling_pipeline},
  year   = {2026}
}

@article{decock2011ames,
  author  = {De Cock, Dean},
  title   = {Ames, Iowa: Alternative to the Boston Housing Data},
  journal = {Journal of Statistics Education},
  volume  = {19},
  number  = {3},
  year    = {2011}
}

License

MIT for the model code and pipeline. The underlying dataset is distributed under Kaggle competition terms (free use with attribution).

Related

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train t22000t/house-prices-tabular-models

Evaluation results

  • Test Gini (CatBoost) on House Prices - Tabular
    self-reported
    0.206
  • Test MAE (CatBoost, USD) on House Prices - Tabular
    self-reported
    16868.000
  • Test Gini (XGBoost) on House Prices - Tabular
    self-reported
    0.205
  • Test MAE (XGBoost, USD) on House Prices - Tabular
    self-reported
    17204.000