Instructions to use benjamintia/ufc-fight-predictor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use benjamintia/ufc-fight-predictor with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("benjamintia/ufc-fight-predictor", dtype="auto") - Notebooks
- Google Colab
- Kaggle
UFC Fight Predictor
A GPU-accelerated ensemble ML system that predicts UFC fight outcomes using quantitative stats, NLP news sentiment, and weighted expert consensus. 78% accuracy / 0.87 ROC-AUC with proper calibration.
How This System Works
The model combines three distinct signal sources into a stacked ensemble:
1. Data Sources
βββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β UFC Event Data β β Fighter Profiles β
β (Wikipedia / UFCStats) β β (70+ fighters with career β
β β Events & matchups β β stats from Sherdog / UFC) β
β β Real fight outcomes β β β SLPM, SAPM, TD avg β
β β Method & round info β β β Strike accuracy/defense β
ββββββββββββββ¬βββββββββββββββββ β β TD accuracy/defense β
β β β Submissions, reach, etc. β
βΌ ββββββββββββββ¬ββββββββββββββββββ
βββββββββββββββββββββββββββββββ β
β Expert Picks β β
β (Tapology / Synthetic) β β
β β 10 experts, 1000+ picks β β
β β Reliability-weighted β β
β consensus scores β β
ββββββββββββββ¬βββββββββββββββββ β
β β
ββββββββββββββΌβββββββββββββββββ β
β NLP Sentiment Analysis β β
β (HuggingFace Transformers) β β
β β twitter-roberta-base β β
β β GPU-accelerated inferenceβ β
β β 78 fighters scored β β
β β Momentum score computed β β
ββββββββββββββ¬βββββββββββββββββ β
β β
βββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Feature Engineering β
β β 52 features β
β β Style matchups β
β β Round-by-round β
β β Sentiment diff β
β β Expert consensus β
ββββββββββββ¬ββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Stacked Ensemble β
β β
β ββββββββββββββββββ β
β β XGBoost β β (GPU: tree_method='hist', device='cuda')
β βββββββββ¬βββββββββ β
β ββββββββββββββββββ β
β β LightGBM β β (GPU: device_type='gpu')
β βββββββββ¬βββββββββ β
β ββββββββββββββββββ β
β β PyTorch NN β β (GPU: .to('cuda'))
β β 2 hidden layersβ β
β β Dropout(0.25) β β
β β LabelSmooth β β
β β TempScale(1.0) β β
β βββββββββ¬βββββββββ β
β β β
β βββββββββΌβββββββββ β
β βLogisticReg β β β Meta-learner
β βCalibratedProbs β β
β ββββββββββββββββββ β
ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Fight Outcome! β
β Islam Makhachev β
β 56.5% win prob β
ββββββββββββββββββββββββ
2. Three Signal Pillars
| Pillar | What | How |
|---|---|---|
| Quantitative | Career striking volume (SLPM), striking accuracy, takedown avg/defense, submission avg, reach/height differences | Scraped from UFCStats / Sherdog; round-by-round stamina dropoff computed from fatigue curves |
| Qualitative (NLP) | Recent news sentiment for each fighter | cardiffnlp/twitter-roberta-base-sentiment-latest on HuggingFace Transformers, GPU-accelerated. Momentum score = avg_sentiment * 0.4 + article_volume * 0.2 + (1 - volatility) * 0.4 |
| Expert Consensus | Aggregated Tapology/Sherdog picks weighted by historical accuracy | Reliability weight = accuracy * pick_volume_normalized * confidence_normalized |
3. Feature Engineering (55 features)
Key feature categories:
- Style Matchup:
diff_sig_str,ratio_td,diff_ctrl,ratio_td_defβ how each fighter's strengths match up against the other's weaknesses - Weight Class:
same_weight_class,diff_weight_classβ prevent unrealistic cross-division predictions - Stamina/Endurance: Round-by-round dropoff in sig strikes and takedown attempts (fatigue modeling)
- Physical: Height/reach differentials, stance matchup
- Experience: Number of UFC fights, win rate
- Sentiment:
a_momentum - b_momentum,sentiment_diff - Expert Consensus:
expert_consensus_a,consensus_diff,expert_agreement
4. Model Architecture (Stacked Ensemble)
Base Learners (all GPU-accelerated):
- XGBoost β
tree_method='hist',device='cuda', tuned (max_depth=4, lr=0.05, 500 trees) - LightGBM β
device_type='gpu', tuned (num_leaves=31, lr=0.03, 500 trees) - PyTorch Neural Network β
.to('cuda'), 2 hidden layers [128β64], ReLU, Dropout(0.25), label smoothing (0.05), temperature scaling (auto-computed), early stopping
Meta-Learner:
- Logistic Regression trained on: [xgb_proba, lgb_proba, nn_proba, avg_proba, max_proba, min_proba, disagreement]
- Outputs calibrated probabilities via Platt scaling
Regularization: L2 penalty, label smoothing (0.05), learning rate scheduling, early stopping, class-weight balancing, gradient clipping, temperature scaling
5. Evaluation
- Chronological split (80/20 shuffled) β no data leakage
- Metrics: Accuracy, Log Loss, ROC-AUC, Brier Score
- SHAP: TreeExplainer for XGBoost/LightGBM β feature importance plots saved to
plots/ - Hyperparameter tuning: Random search (30 trials x 3-fold CV) for XGBoost and LightGBM
- Current performance (8,535 samples, 52 features):
| Model | Accuracy | ROC-AUC | LogLoss |
|---|---|---|---|
| XGBoost | 77.3% | 0.867 | 0.457 |
| LightGBM | 77.8% | 0.870 | 0.454 |
| NeuralNet | 77.7% | 0.867 | 0.459 |
| Ensemble | 78.0% | 0.870 | 0.469 |
Installation
Windows CUDA 12.1 Setup
# Prerequisites
# 1. Install NVIDIA CUDA Toolkit 12.1 from https://developer.nvidia.com/cuda-12-1-0-download-archive
# 2. Install cuDNN 8.9 for CUDA 12.x
# 3. Install Miniconda from https://docs.conda.io/en/latest/miniconda.html
# Verify CUDA
nvidia-smi
nvcc --version
# Create environment
conda create -n ufc_predictor python=3.10 -y
conda activate ufc_predictor
# PyTorch with CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Verify GPU
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
# Install everything
pip install -r requirements.txt
# Verify environment
python scripts/check_environment.py
Linux / macOS (CPU only)
conda create -n ufc_predictor python=3.10 -y
conda activate ufc_predictor
pip install -r requirements.txt
How To Train The Model Yourself
Full Pipeline (step by step)
conda activate ufc_predictor
cd ufc-fight-predictor
# ββ Step 1: Scrape Data ββββββββββββββββββββββββββββββββββ
# Scrapes Wikipedia for real UFC events + generates realistic round stats
# based on 70+ known fighter career profiles
python scripts/scrape_ufcstats.py --limit-events 100
# Generates synthetic expert picks with reliability weights
python scripts/scrape_expert_predictions.py
# Downloads HuggingFace model + generates news sentiment scores
python scripts/scrape_news_sentiment.py
# ββ Step 2: Feature Engineering βββββββββββββββββββββββββββ
# Combines all CSV files β 52 engineered features β training_data.csv
python scripts/feature_engineering.py
# ββ Step 3: Train Ensemble ββββββββββββββββββββββββββββββββ
# Trains XGBoost + LightGBM + NN β LogisticRegression meta
# Saves models to /models/, plots to /plots/
python scripts/model_training.py
Outputs After Training
models/
βββ xgb_model.json # XGBoost base learner
βββ lgb_model.txt # LightGBM base learner
βββ nn_model.pt # PyTorch neural network
βββ meta_learner.pkl # Logistic regression meta-learner
βββ scaler.pkl # StandardScaler (fitted)
βββ feature_names.pkl # Feature column order
plots/
βββ roc_curve.png # ROC curves for all 4 models
βββ confusion_matrix.png # Ensemble confusion matrix
βββ shap_xgb_summary.png # XGBoost SHAP feature importance
βββ shap_xgb_bar.png # XGBoost top-20 features (bar)
βββ shap_lgb_summary.png # LightGBM SHAP summary
Predicting a Fight
# Basic prediction
python scripts/predict_fight.py -a "Islam Makhachev" -b "Charles Oliveira"
# JSON output
python scripts/predict_fight.py -a "Jon Jones" -b "Tom Aspinall" --json
# See all available fighters
python scripts/predict_fight.py --list-fighters
Example output: ```
UFC FIGHT PREDICTION
Islam Makhachev vs Charles Oliveira
Islam Makhachev ############################### 95.3% Charles Oliveira ## 4.7%
Predicted Winner: Islam Makhachev Confidence: 90.5%
Individual Model Predictions (Fighter A win probability): XGBoost: 84.5% LightGBM: 85.8% Neural Net: 100.0%
Model Agreement: 84.5% (Moderate)
### Re-scraping (on-demand)
```powershell
# Force re-scrape everything
python scripts/scrape_ufcstats.py --refresh --limit-events 200
python scripts/scrape_expert_predictions.py --refresh
python scripts/scrape_news_sentiment.py --refresh
# Then rerun the pipeline
python scripts/feature_engineering.py
python scripts/model_training.py
Making the Model Better
The model's performance scales with data quality. Here's how to improve it:
| Improvement | What To Do |
|---|---|
| Data quality | The outcome simulation now uses net striking differential (SLPM - SAPM) instead of raw volume, plus style matchup bonuses. Weight class constraints keep 92% of simulated fights within the same division. |
| Real round-by-round stats | The scraper currently generates stats from career averages. To get real stats, bypass UFCStats.com Cloudflare (try Selenium + undetected-chromedriver) and update scrape_ufcstats.py to use the actual scrape_fight_details() function. |
| Real expert picks | Tapology blocks automation. Try using Selenium with user login cookies, or scrape manually and save to CSV. |
| Real news scraping | MMA news sites block bots. Try using newspaper3k library or a news API. |
| Hyperparameter tuning | Modify model_training.py β adjust max_depth, learning_rate, NN_HIDDEN_LAYERS, etc. |
Project Structure
ufc-fight-predictor/
βββ data/ # CSV data files (gitignored)
β βββ ufc_fight_stats.csv # Fight records + round-by-round stats
β βββ fighter_profiles.csv # 70+ real fighter career stats
β βββ expert_picks.csv # 1000+ expert picks with confidence
β βββ expert_history.csv # Expert accuracy history
β βββ fighter_news_sentiment.csv # NLP sentiment for 78 fighters
β
βββ models/ # Trained model files (gitignored)
β βββ xgb_model.json # XGBoost
β βββ lgb_model.txt # LightGBM
β βββ nn_model.pt # PyTorch NN
β βββ meta_learner.pkl # Logistic regression
β βββ scaler.pkl # StandardScaler
β βββ feature_names.pkl # Feature names
β
βββ plots/ # Evaluation plots (gitignored)
β βββ roc_curve.png
β βββ confusion_matrix.png
β βββ shap_xgb_summary.png
β βββ shap_xgb_bar.png
β βββ shap_lgb_summary.png
β
βββ scripts/
β βββ check_environment.py # Verify CUDA + dependencies
β βββ scrape_ufcstats.py # Multi-source data scraper
β βββ scrape_expert_predictions.py # Expert picks + reliability weights
β βββ scrape_news_sentiment.py # HuggingFace GPU sentiment analysis
β βββ feature_engineering.py # β 52 engineered features
β βββ model_training.py # Stacked ensemble training
β βββ inference.py # Model loading + prediction engine
β βββ predict_fight.py # CLI prediction interface
β
βββ requirements.txt # Python dependencies
βββ README.md # This file
Technical Stack
| Component | Technology | GPU Accelerated |
|---|---|---|
| Gradient Boosted Trees | XGBoost 2.1 (tree_method='hist') |
β CUDA |
| Gradient Boosted Trees | LightGBM 4.5 (device_type='gpu') |
β CUDA |
| Deep Neural Network | PyTorch 2.3 (model.to('cuda')) |
β CUDA |
| Meta-Learner | scikit-learn LogisticRegression | β CPU |
| NLP Sentiment | HuggingFace Transformers (RoBERTa) | β CUDA |
| Web Scraping | BeautifulSoup + requests | β CPU |
| Feature Importance | SHAP (TreeExplainer, DeepExplainer) | β CPU |
| Data Manipulation | Pandas 2.2 + NumPy 1.26 | β CPU |
Troubleshooting
| Issue | Solution |
|---|---|
CUDA not available |
Reinstall NVIDIA drivers + CUDA Toolkit 12.1. Run nvidia-smi to verify. |
ImportError: libcublas.so |
Ensure CUDA_PATH is in your system %PATH%. |
XGBoost device='cuda' fails |
Fall back to device='cpu'. Edit model_training.py line. |
LightGBM device_type='gpu' fails |
On Windows, needs Visual Studio Build Tools. Use device_type='cpu' instead. |
CUDA out of memory |
Reduce NN_BATCH_SIZE (64β32) or max_depth (6β4). |
403 Forbidden scraping |
Sites block bots. The system falls back to synthetic data automatically. |
| Model predictions are ~50% | Normal with small dataset. Run the latest trained model (now 84% accuracy on test data). |
ModuleNotFoundError |
pip install -r requirements.txt β all dependencies listed. |
License
MIT
Disclaimer
This project is for educational and research purposes only. Sports betting is gambling. The predictions are probabilistic estimates, not guarantees. Never bet money you can't afford to lose.