🏃 NYC 2025 Marathon - Finish Time Predictor

📹 Presentation Video

📌 Project Overview

This project analyzes the NYC 2025 Marathon dataset from HuggingFace, containing 1.88 million rows of split times for every runner at every kilometer checkpoint.

Goal: Predict a runner's total finish time based on early race performance and demographic features.

Dataset: NYC 2025 Marathon Splits

📊 Key Results

Model	MAE	R²
Baseline Linear Regression	10.13 min	0.946
Linear Regression (Engineered)	2.52 min	0.996
Random Forest ✅	1.37 min	0.997
XGBoost	1.82 min	0.997
Random Forest (Tuned) 🏆	0.03 min	0.977

Classification Results (3 Classes: Fast / Average / Slow)

Model	Accuracy
Logistic Regression 🏆	99%
Random Forest	98%
XGBoost	98%

🔧 Methodology

1. Data Preparation

Sampled 12,000 unique runners (random_state=42)
Converted all time columns from H:MM:SS to seconds
Removed outliers (finish times < 2hrs or > 12hrs)
Pivoted from long format to wide format (one row per runner)

2. EDA & Research Questions

Q1: Do different countries use different pacing strategies?
Q2: Do older runners start more conservatively?
Q3: Which countries have the fastest runners?

3. Feature Engineering

One-Hot Encoding for Gender
Pacing ratio features (5K pace / overall pace)
First half speed feature
K-Means Clustering (4 clusters) → cluster_id as new feature
Distance to centroid feature

4. Models Trained

Regression: Linear Regression, Random Forest, XGBoost + Hyperparameter Tuning

Classification: Logistic Regression, Random Forest, XGBoost

🏆 Key Findings

first_half_speed is the most important feature (48.9% importance)
split_HALF is the second most important feature (36.8%)
Runners who start too fast in the first 5K tend to finish slower
Brazil and Great Britain produce the fastest recreational runners
Peak marathon performance age is 25-34

🎨 Bonus Work

Hyperparameter Tuning → 97.5% MAE improvement
Interactive 3D World Map (Plotly)
Animated Racing Bar Chart
Interactive Scatter Plot with hover details
Business & Domain Insights for marathon coaches

📊 Visualizations

EDA - Research Questions

Pacing strategies by country, age vs pacing, and fastest countries

Runner Clusters (K-Means)

4 distinct runner groups identified: Fast, Mid-Pack Younger, Mid-Pack Older, and Slow Runners

Feature Importance - Tuned Random Forest

first_half_speed (48.9%) and split_HALF (36.8%) are the strongest predictors

Model Comparison - MAE

Random Forest achieved the lowest MAE of 1.37 minutes — 87% improvement over baseline

Confusion Matrices - Classification

Logistic Regression achieved 99% accuracy across all 3 classes

Class Distribution

Perfectly balanced classes: Fast / Average / Slow (33.3% each)

📁 Repository Contents

random_forest_model.pkl - Best regression model
logistic_regression_classifier.pkl - Best classification model
notebook.ipynb - Full analysis notebook

🔍 Reflections & Lessons Learned

Biggest Challenge: The raw data was in long format (1.88M rows) — one row per runner per checkpoint, Converting it to wide format (one row per runner) was the most complex data engineering step.
Data Leakage: Early model results were suspiciously perfect.
The reason I accidentally included end of race data as features, Removing those columns brought the model back to realistic performance.
Surprising Finding: Logistic Regression — the simplest classification model — outperformed Random Forest and XGBoost with 99% accuracy, Sometimes the simplest solution wins.
Key Insight: The first half speed alone explains 49% of the final finish time, Pacing strategy in the early kilometers is everything in marathon running.

🛠 Libraries Used

pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, plotly, datasets

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support