YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π NYC 2025 Marathon - Finish Time Predictor
πΉ Presentation Video
π₯ Click here to watch the presentation
π Project Overview
This project analyzes the NYC 2025 Marathon dataset from HuggingFace, containing 1.88 million rows of split times for every runner at every kilometer checkpoint.
Goal: Predict a runner's total finish time based on early race performance and demographic features.
Dataset: NYC 2025 Marathon Splits
π Key Results
| Model | MAE | RΒ² |
|---|---|---|
| Baseline Linear Regression | 10.13 min | 0.946 |
| Linear Regression (Engineered) | 2.52 min | 0.996 |
| Random Forest β | 1.37 min | 0.997 |
| XGBoost | 1.82 min | 0.997 |
| Random Forest (Tuned) π | 0.03 min | 0.977 |
Classification Results (3 Classes: Fast / Average / Slow)
| Model | Accuracy |
|---|---|
| Logistic Regression π | 99% |
| Random Forest | 98% |
| XGBoost | 98% |
π§ Methodology
1. Data Preparation
- Sampled 12,000 unique runners (random_state=42)
- Converted all time columns from H:MM:SS to seconds
- Removed outliers (finish times < 2hrs or > 12hrs)
- Pivoted from long format to wide format (one row per runner)
2. EDA & Research Questions
- Q1: Do different countries use different pacing strategies?
- Q2: Do older runners start more conservatively?
- Q3: Which countries have the fastest runners?
3. Feature Engineering
- One-Hot Encoding for Gender
- Pacing ratio features (5K pace / overall pace)
- First half speed feature
- K-Means Clustering (4 clusters) β cluster_id as new feature
- Distance to centroid feature
4. Models Trained
Regression: Linear Regression, Random Forest, XGBoost + Hyperparameter Tuning
Classification: Logistic Regression, Random Forest, XGBoost
π Key Findings
- first_half_speed is the most important feature (48.9% importance)
- split_HALF is the second most important feature (36.8%)
- Runners who start too fast in the first 5K tend to finish slower
- Brazil and Great Britain produce the fastest recreational runners
- Peak marathon performance age is 25-34
π¨ Bonus Work
- Hyperparameter Tuning β 97.5% MAE improvement
- Interactive 3D World Map (Plotly)
- Animated Racing Bar Chart
- Interactive Scatter Plot with hover details
- Business & Domain Insights for marathon coaches
π Visualizations
EDA - Research Questions
Pacing strategies by country, age vs pacing, and fastest countries
Runner Clusters (K-Means)
4 distinct runner groups identified: Fast, Mid-Pack Younger, Mid-Pack Older, and Slow Runners
Feature Importance - Tuned Random Forest
first_half_speed (48.9%) and split_HALF (36.8%) are the strongest predictors
Model Comparison - MAE
Random Forest achieved the lowest MAE of 1.37 minutes β 87% improvement over baseline
Confusion Matrices - Classification
Logistic Regression achieved 99% accuracy across all 3 classes
Class Distribution
Perfectly balanced classes: Fast / Average / Slow (33.3% each)
π Repository Contents
random_forest_model.pkl- Best regression modellogistic_regression_classifier.pkl- Best classification modelnotebook.ipynb- Full analysis notebook
π Reflections & Lessons Learned
- Biggest Challenge: The raw data was in long format (1.88M rows) β one row per runner per checkpoint, Converting it to wide format (one row per runner) was the most complex data engineering step.
- Data Leakage: Early model results were suspiciously perfect.
- The reason I accidentally included end of race data as features, Removing those columns brought the model back to realistic performance.
- Surprising Finding: Logistic Regression β the simplest classification model β outperformed Random Forest and XGBoost with 99% accuracy, Sometimes the simplest solution wins.
- Key Insight: The first half speed alone explains 49% of the final finish time, Pacing strategy in the early kilometers is everything in marathon running.
π Libraries Used
pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, plotly, datasets