player-value-lgbm
LightGBM model that forecasts the 6-month log-change in Transfermarkt
market value for football players. Trained on the engineered features
in player-value-features.
- Code: https://github.com/DanielRegaladoUMiami/player-value-ml
- Demo: https://huggingface.co/spaces/DanielRegaladoCardoso/player-value-ml
- Dataset: player-value-features
Performance (held-out test, ≥ 2024)
| Model | Test MAE (log) | Test R² |
|---|---|---|
| Naive (y=0) | 0.2145 | −0.006 |
| Ridge regression | 0.2164 | +0.130 |
| AutoETS | 0.2361 | −0.198 |
| AutoTheta | 0.2740 | −0.298 |
| LightGBM | 0.1970 | +0.205 |
LightGBM wins by ~40% in R² over the best baseline and is the only non-trivial model to clearly beat the naive predictor.
Architecture
- LightGBM regression with L1 objective (MAE)
- 351 trees (early-stopped on validation)
num_leaves=127,learning_rate=0.05,feature_fraction=0.8,bagging_fraction=0.8,lambda_l1=0.1,lambda_l2=0.1- 84 features (92 raw, 8 dropped by the audit), 5 categorical handled
natively (
position,sub_position,foot,career_stage,country_of_citizenship)
Top features by gain
country_of_citizenship— the market prices passport over performanceas_of_year— captures market inflationvalue_diff_1— recent momentumlog_value— base levelageas_of_month— transfer-window seasonalityvalue_diff_3international_capsage_minus_position_peak— distance from position-specific peakvalue_lag_1
How to use
import lightgbm as lgb
import pandas as pd
from huggingface_hub import hf_hub_download
model_path = hf_hub_download("DanielRegaladoCardoso/player-value-lgbm",
"model.lgb")
booster = lgb.Booster(model_file=model_path)
# X must be a DataFrame whose categorical columns are pd.Categorical
# (see scripts/train_lgbm.py in the repo for the exact preprocessing)
preds = booster.predict(X) # predicts log(value_T+6mo / value_T)
To convert a prediction back into euros:
predicted_value_eur = current_value_eur * np.exp(predicted_log_ratio)
Limitations
- The 6-month horizon is a median — Transfermarkt updates 2–3×/year irregularly. The model is trained on the irregular cadence; predictions are not calibrated for a fixed 180-day horizon.
- R² of +0.20 means ~80% of valuation-change variance is NOT explained by these features — markets are noisy.
country_of_citizenshipbeing the strongest feature reflects market bias the model captures (and does not endorse).- Trained only on Transfermarkt's tracked competitions; lower-tier leagues are under-represented.
License
Apache 2.0.