House Prices - Tabular Models (CatBoost + XGBoost baseline)
Pre-trained baseline models for the t22000t/house-prices-tabular dataset, produced by the tabular-data-modelling-pipeline.
This is the v1 baseline drop - CatBoost + XGBoost trained with sensible defaults (no Optuna tuning) on an 80/20 random split. A follow-up release will add the six deep-learning architectures (CANN, CANN-GBM, FT-Transformer, TabM, LocalGLMnet, DRN) once they're retrained on this dataset.
Results
| Model | Test Gini | Train Gini | Test MAE (USD) | Test RMSE (USD) | A/E ratio | n params | Training time |
|---|---|---|---|---|---|---|---|
| CatBoost | 0.2061 | 0.2203 | 16,868 | 27,063 | 1.025 | 1,041 trees | 4.4 s |
| XGBoost | 0.2049 | 0.2212 | 17,204 | 29,716 | 0.999 | 462 trees | 0.3 s |
| Stacked ensemble (NNLS) | 0.2049 | 0.2212 | 17,204 | 29,716 | 0.999 | (2 weights) | - |
- Test set: 304 rows (20% of 1,460)
- Target:
SalePrice(USD) - Loss: Gamma deviance (gamma family, log link)
- Target cap: 99.5th percentile = $555,355 (6 rows winsorised)
- Random seed: 42
The NNLS-stacked ensemble currently degenerates to XGBoost; with more diverse base learners (the upcoming DL drop) it will pick a non-trivial blend.
Files
| File | What it is | Size |
|---|---|---|
catboost.cbm |
Trained CatBoost model (native format) | 1.2 MB |
xgboost.json |
Trained XGBoost Booster (native JSON format) | 1.3 MB |
evaluation_summary.csv |
Per-model train/test Gini, MAE, RMSE, A/E ratio, gamma deviance | 315 B |
ensemble_weights.json |
NNLS-stacked weights over base predictions | 53 B |
dashboard_dl_models.html |
Interactive Plotly dashboard (Lorenz curves, calibration deciles, ensemble plots) | 4.6 MB |
figures/fig_dl_*.png |
Publication-quality figures matching the dashboard | ~6 MB total |
model_summary.json |
Structured run record (config, metrics, timing) | 3.2 KB |
Loading and inference
CatBoost
from huggingface_hub import hf_hub_download
from catboost import CatBoostRegressor
import pandas as pd
path = hf_hub_download(
repo_id="t22000t/house-prices-tabular-models",
filename="catboost.cbm",
)
model = CatBoostRegressor()
model.load_model(path)
# Load the dataset and predict
df = pd.read_csv("hf://datasets/t22000t/house-prices-tabular/train.csv")
# Use only the columns the model was trained on (see model_summary.json)
features = [
"LotArea", "YearBuilt", "YearRemodAdd", "TotalBsmtSF", "1stFlrSF",
"2ndFlrSF", "GrLivArea", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd",
"GarageCars", "GarageArea", "OverallQual", "OverallCond",
"MSZoning", "Street", "LotShape", "Neighborhood", "BldgType",
"HouseStyle", "RoofStyle", "ExterQual", "Foundation", "Heating",
"CentralAir", "KitchenQual", "SaleType", "SaleCondition",
]
preds = model.predict(df[features])
XGBoost
from huggingface_hub import hf_hub_download
import xgboost as xgb
path = hf_hub_download(
repo_id="t22000t/house-prices-tabular-models",
filename="xgboost.json",
)
booster = xgb.Booster()
booster.load_model(path)
# Predictions require the exact feature order used at training time;
# easiest path is to re-run the pipeline's preprocessing - see the
# GitHub repo for the full feature build code.
Stacked ensemble
import json
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="t22000t/house-prices-tabular-models",
filename="ensemble_weights.json",
)
weights = json.loads(open(path).read())
# weights = {"catboost": 0.0, "xgboost": 1.0} (NNLS picked XGBoost only)
ensemble_pred = weights["catboost"] * cb_pred + weights["xgboost"] * xgb_pred
Training configuration
| Setting | Value |
|---|---|
| Pipeline | tabular-data-modelling-pipeline v0.1.0 |
| Architecture mix | CatBoost + XGBoost (DL models excluded from this drop) |
| Hyperparameters | Defaults (see modelling/models/__init__.py) - no Optuna tuning |
| Optimiser | CatBoost: ordered boosting; XGBoost: hist tree method |
| Family / link | Gamma / log |
| Train/test split | Random 80/20, seed 42 |
| Cap percentile | 99.5 |
| CV folds | 5 (for stability check) |
| Hardware | Apple M-series, CPU |
To reproduce exactly, run:
git clone https://github.com/timothy22000/tabular_data_modelling_pipeline
cd tabular_data_modelling_pipeline
pip install -e ".[gbm,viz]"
python scripts/download_data.py --dataset house_prices
OMP_NUM_THREADS=1 python train.py \
--config configs/example_house_prices.py \
--input data/house_prices.csv \
--skip-tuning --skip-interpretability \
--architectures catboost xgboost
(OMP_NUM_THREADS=1 is only needed on macOS arm64 to avoid an OpenMP
conflict between XGBoost and Python's threading; Linux runs are unaffected.)
Limitations
- Defaults only. No hyperparameter tuning - tuned models would close the train-test gap and likely lift Gini by 0.02-0.05.
- GBM only. This drop omits the six DL architectures. CANN-GBM in particular would likely outperform raw XGBoost since it adds a neural residual on top of the GBM base. v2 will include these.
- Random split, not stratified. SalePrice has a heavy right tail; a stratified split (or quantile-stratified) would give a more representative test set. Default behaviour, kept for reproducibility.
- Trained on training set only. The Kaggle competition's
test.csvis unlabelled and not used here. To compare against the official leaderboard, train on the full set and submit predictions on test. - Gini scores look modest. Gini in [0.20, 0.22] is reasonable for this dataset's modest signal-to-noise ratio - Kaggle leaderboard RMSLE is the more conventional metric for House Prices, but the pipeline uses Gini and MAE for cross-comparability across architectures and datasets.
Intended use
- Baseline for tabular DL research. Comparing your new architecture against these numbers.
- Teaching. Demonstrating a calibrated tabular pricing pipeline end to end.
- Sanity check. Make sure your reimplementation of CatBoost/XGBoost on this data hits similar numbers.
Citation
@software{tabular_data_modelling_pipeline,
author = {Mun, Timothy},
title = {tabular-data-modelling-pipeline},
url = {https://github.com/timothy22000/tabular_data_modelling_pipeline},
year = {2026}
}
@article{decock2011ames,
author = {De Cock, Dean},
title = {Ames, Iowa: Alternative to the Boston Housing Data},
journal = {Journal of Statistics Education},
volume = {19},
number = {3},
year = {2011}
}
License
MIT for the model code and pipeline. The underlying dataset is distributed under Kaggle competition terms (free use with attribution).
Related
- ๐ Dataset: t22000t/house-prices-tabular
- ๐ฆ Pipeline: tabular-data-modelling-pipeline
- ๐ Privacy Lab Space - anonymize tabular data + red-team it
Dataset used to train t22000t/house-prices-tabular-models
Evaluation results
- Test Gini (CatBoost) on House Prices - Tabularself-reported0.206
- Test MAE (CatBoost, USD) on House Prices - Tabularself-reported16868.000
- Test Gini (XGBoost) on House Prices - Tabularself-reported0.205
- Test MAE (XGBoost, USD) on House Prices - Tabularself-reported17204.000