YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸƒ NYC 2025 Marathon - Finish Time Predictor

πŸ“Ή Presentation Video

πŸŽ₯ Click here to watch the presentation


πŸ“Œ Project Overview

This project analyzes the NYC 2025 Marathon dataset from HuggingFace, containing 1.88 million rows of split times for every runner at every kilometer checkpoint.

Goal: Predict a runner's total finish time based on early race performance and demographic features.

Dataset: NYC 2025 Marathon Splits


πŸ“Š Key Results

Model MAE RΒ²
Baseline Linear Regression 10.13 min 0.946
Linear Regression (Engineered) 2.52 min 0.996
Random Forest βœ… 1.37 min 0.997
XGBoost 1.82 min 0.997
Random Forest (Tuned) πŸ† 0.03 min 0.977

Classification Results (3 Classes: Fast / Average / Slow)

Model Accuracy
Logistic Regression πŸ† 99%
Random Forest 98%
XGBoost 98%

πŸ”§ Methodology

1. Data Preparation

  • Sampled 12,000 unique runners (random_state=42)
  • Converted all time columns from H:MM:SS to seconds
  • Removed outliers (finish times < 2hrs or > 12hrs)
  • Pivoted from long format to wide format (one row per runner)

2. EDA & Research Questions

  • Q1: Do different countries use different pacing strategies?
  • Q2: Do older runners start more conservatively?
  • Q3: Which countries have the fastest runners?

3. Feature Engineering

  • One-Hot Encoding for Gender
  • Pacing ratio features (5K pace / overall pace)
  • First half speed feature
  • K-Means Clustering (4 clusters) β†’ cluster_id as new feature
  • Distance to centroid feature

4. Models Trained

Regression: Linear Regression, Random Forest, XGBoost + Hyperparameter Tuning

Classification: Logistic Regression, Random Forest, XGBoost


πŸ† Key Findings

  • first_half_speed is the most important feature (48.9% importance)
  • split_HALF is the second most important feature (36.8%)
  • Runners who start too fast in the first 5K tend to finish slower
  • Brazil and Great Britain produce the fastest recreational runners
  • Peak marathon performance age is 25-34

🎨 Bonus Work

  • Hyperparameter Tuning β†’ 97.5% MAE improvement
  • Interactive 3D World Map (Plotly)
  • Animated Racing Bar Chart
  • Interactive Scatter Plot with hover details
  • Business & Domain Insights for marathon coaches

πŸ“Š Visualizations

EDA - Research Questions

EDA Plots Pacing strategies by country, age vs pacing, and fastest countries

Runner Clusters (K-Means)

Clusters 4 distinct runner groups identified: Fast, Mid-Pack Younger, Mid-Pack Older, and Slow Runners

Feature Importance - Tuned Random Forest

Feature Importance first_half_speed (48.9%) and split_HALF (36.8%) are the strongest predictors

Model Comparison - MAE

Model Comparison Random Forest achieved the lowest MAE of 1.37 minutes β€” 87% improvement over baseline

Confusion Matrices - Classification

Confusion Matrices Logistic Regression achieved 99% accuracy across all 3 classes

Class Distribution

Class Distribution Perfectly balanced classes: Fast / Average / Slow (33.3% each)

πŸ“ Repository Contents

  • random_forest_model.pkl - Best regression model
  • logistic_regression_classifier.pkl - Best classification model
  • notebook.ipynb - Full analysis notebook

πŸ” Reflections & Lessons Learned

  • Biggest Challenge: The raw data was in long format (1.88M rows) β€” one row per runner per checkpoint, Converting it to wide format (one row per runner) was the most complex data engineering step.
  • Data Leakage: Early model results were suspiciously perfect.
  • The reason I accidentally included end of race data as features, Removing those columns brought the model back to realistic performance.
  • Surprising Finding: Logistic Regression β€” the simplest classification model β€” outperformed Random Forest and XGBoost with 99% accuracy, Sometimes the simplest solution wins.
  • Key Insight: The first half speed alone explains 49% of the final finish time, Pacing strategy in the early kilometers is everything in marathon running.

πŸ›  Libraries Used

pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, plotly, datasets

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support