βš½πŸ€πŸŽΎπŸ Sports-Trends Models

Calibrated, leakage-safe match-outcome models β€” a model picked for each sport.

Part of Ruslan Magana Sports Intelligence β€” AI match predictions, live results and trending games, refreshed every day.

🌐 Live dashboard πŸ€— Dataset GitHub License: MIT

TL;DR β€” This repository hosts the production models behind ruslanmv.com/sports-trends. Each sport gets the algorithm best suited to its dynamics, every model outputs probability-calibrated win/draw/loss odds, and the whole training pipeline is leakage-safe by construction. Models are retrained automatically and published under <sport>/latest/.

πŸ“Š For information & entertainment only β€” not betting advice.


🎯 What these models do

Given an upcoming fixture, the models estimate the probability of each outcome:

Sport Outcome space Model Why this model
⚽ Football home / draw / away (3-way) HistGradientBoostingClassifier Draws + non-linear EloΓ—form interactions; gradient boosting handles the 3-way target and feature interactions best.
πŸ€ Basketball home / away (2-way) LogisticRegression No draws and a strong linear Elo signal β€” calibrated logistic regression gives clean, well-behaved probabilities.
🎾 Tennis player 1 / player 2 (2-way) GradientBoostingClassifier Head-to-head, surface and form interactions are non-linear; GBDT captures them on short player histories.
🏏 Cricket home / away (2-way) RandomForestClassifier Format-dependent, noisy results; bagged trees are robust to variance and outliers.
πŸ† World Cup / international 90-min result + to-advance Elo + tournament model Adds host advantage, neutral venue, confederation strength and stage importance, plus an extra-time/penalties "who advances" layer.

Every estimator is wrapped in probability calibration (isotonic for tree models, sigmoid for logistic) so a published "62%" really behaves like 62% over many games β€” with a safe fallback to the raw estimator when a dataset is too small to calibrate.


🧠 How the predictions work

Predictions are not a black box and not scraped odds. They are produced by a transparent, reproducible pipeline. Every published prediction ships with a short plain-language explanation of the drivers behind it.

   Free sports APIs                 Feature engineering            Per-sport model
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   normalize  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  infer  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ fixtures, resultsβ”‚ ───────────▢ β”‚ Elo Β· form Β· H2H Β· β”‚ ──────▢ β”‚ calibrated     β”‚
 β”‚ (multi-source)   β”‚   canonical  β”‚ rest Β· home adv Β·  β”‚         β”‚ probabilities  β”‚
 β”‚  + offline mock  β”‚   schema     β”‚ league/social      β”‚         β”‚ + explanation  β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                                   β–²                              β”‚
        β”‚            leakage guard: only matches with date < fixture date  β”‚
        └──────────────────────────────────────────────────────────── publish JSON

1. Ingest. Fixtures and results are pulled from free sports APIs (with a public-domain World Cup feed and a deterministic offline mock as fallbacks), then normalized to one canonical schema with stable IDs and de-duplication.

2. Features (leakage-safe). For each fixture we compute only information available before kickoff:

  • Elo ratings β€” a self-correcting team/player strength rating updated after each result.
  • Recent form β€” rolling performance over the last N matches.
  • Head-to-head β€” historical record between the two sides.
  • Rest days & congestion β€” fatigue from fixture density.
  • Home advantage β€” venue effect (and, for the World Cup, host-nation + neutral-venue handling).
  • League / tournament strength & stage importance β€” context weighting.
  • Social interest β€” popularity signal used for ranking, not for the core outcome.

3. Train. Labels and features are joined, split chronologically (train β†’ validation β†’ test, never random), and checked by automated leakage assertions before any model sees them. Each sport trains its zoo model (table above) and the probabilities are calibrated.

4. Infer. The inference window passes through the model to produce per-outcome probabilities. When a trained model isn't available for a fixture, the system degrades gracefully to a transparent Elo heuristic β€” so the product never shows a blank.

5. Publish. Outputs are written as static JSON and rendered on the live dashboard, each card carrying its probabilities, a confidence value, and the human-readable reasoning.

πŸ›‘οΈ Why you can trust the numbers

  • Leakage-safe by construction. Features for a fixture use only matches with date < fixture.match_date. A regression test plants a future blowout and asserts the pre-match Elo is unchanged β€” guaranteeing no peeking at the result.
  • Calibrated probabilities. Outputs are calibrated, so they are meaningful as odds, not just rankings. We report log loss (calibration quality), not only accuracy.
  • Honest baselines. These are well-understood, auditable scikit-learn models β€” chosen for reliability and explainability over hype.
  • Reproducible & open. The full pipeline is open source on GitHub and runs automatically in CI.

πŸ“¦ Repository layout

<sport>/latest/
    model.pkl              # calibrated scikit-learn estimator (joblib)
    feature_schema.json    # ordered feature names + dtypes expected at inference
    metrics.json           # holdout accuracy + log loss for this version
    README.md              # per-sport card
registry/
    latest_versions.json   # production pointer: sport -> {version, path, metrics}

registry/latest_versions.json is the source of truth for which version is live.


πŸš€ Use a model

import json
import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

REPO = "ruslanmv/sports-trends-models"
SPORT = "football"

model = joblib.load(hf_hub_download(REPO, f"{SPORT}/latest/model.pkl"))
schema = json.load(open(hf_hub_download(REPO, f"{SPORT}/latest/feature_schema.json")))

# Build one row with the features named in feature_schema.json (same order).
features = {name: 0.0 for name in schema["features"]}
X = pd.DataFrame([features])[schema["features"]]

proba = model.predict_proba(X)[0]
print(dict(zip(model.classes_, proba.round(3))))
# e.g. {'home': 0.58, 'draw': 0.18, 'away': 0.24}

Check the live production versions and metrics:

import json
from huggingface_hub import hf_hub_download
reg = json.load(open(hf_hub_download("ruslanmv/sports-trends-models",
                                     "registry/latest_versions.json")))
print(reg["football"])  # -> {'version': ..., 'path': 'football/latest/', 'metrics': {...}}

πŸ“ˆ Evaluation

Each version stores its own metrics.json (holdout accuracy and calibrated log loss) produced on a chronological hold-out split. Sports outcomes are high-variance, so treat metrics as relative model-quality signals rather than guarantees: a well-calibrated football model typically lands meaningfully above the 3-way random/majority baseline, and log loss is the metric we optimise for because calibration matters more than raw accuracy for probabilistic predictions.

Numbers in the model-index above are indicative placeholders; the authoritative, per-version metrics always live in each <sport>/latest/metrics.json.


⚠️ Intended use & limitations

Intended use β€” research, education, sports analytics, and powering the ruslanmv.com/sports-trends dashboard.

Out of scope / limitations

  • 🚫 Not betting advice. Predictions are informational and for entertainment only. No outcome is guaranteed. Please gamble responsibly, if at all.
  • Models reflect their training data: lower-tier competitions and rare matchups carry more uncertainty than top leagues.
  • Free data sources can be delayed or incomplete; the system favours graceful degradation (Elo heuristic / mock) over fabricated precision.
  • They estimate probabilities, not certainties β€” variance and upsets are expected.

🧾 Citation

@software{magana_sports_trends_2026,
  author  = {Ruslan Magana Vsevolodovna},
  title   = {Sports-Trends: Calibrated, leakage-safe sports outcome models},
  year    = {2026},
  url      = {https://huggingface.co/ruslanmv/sports-trends-models},
  note     = {Live dashboard: https://ruslanmv.com/sports-trends/}
}

πŸ‘€ About the author

Built and maintained by Ruslan Magana Vsevolodovna β€” AI / ML engineer working on data platforms, MLOps and applied machine learning.

🌐 ruslanmv.com Β· πŸ“Š Live dashboard Β· πŸ€— Dataset Β· πŸ’» GitHub

Powered by Hugging Face πŸ€— + GitHub Actions βš™οΈ Β· Licensed MIT Β· Not betting advice.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Accuracy (holdout, sport-dependent) on ruslanmv/sports-trends-dataset
    self-reported
    0.550
  • Log loss (calibrated) on ruslanmv/sports-trends-dataset
    self-reported
    0.980