TB Vulnerability Hotspot Predictor - India

Model Description

An industry-grade hybrid ML pipeline for predicting tuberculosis (TB) vulnerability hotspots across India at district level. Combines gradient-boosted decision trees with transformer-based contextual embeddings for high-resolution vulnerability mapping.

Architecture

Input (153 features) → Quantile Transform
    ├── FT-Transformer (d=64, L=2, h=4) → [CLS] embedding (64-dim)
    └── Quantile-transformed features
         ↓
    Concatenated features (153 + 64 = 217)
         ↓
    ├── XGBoost (500 trees, depth=5)
    └── LightGBM (500 trees, depth=5)
         ↓
    Ensemble (0.5 × XGB + 0.5 × LGB)
         ↓
    Probabilistic Calibration (Platt + Isotonic + Temperature)
         ↓
    Calibrated Risk Probabilities

Performance

Classification (Hotspot Detection)

Metric	Test Set	3-Fold CV
AUC-ROC	0.926	0.984
Accuracy	0.911	0.930
F1 (macro)	0.850	0.905
Precision	0.870	-
Recall	0.667	-
ECE	0.105	-
Brier Score	0.074	-

Regression (TB Notification Rate)

Metric	Value
RMSE	9.87 per 100K
R²	0.915
MAE	6.39 per 100K

Ablation Study Results

Spatial Encoding Impact (AUC-ROC)

Method	AUC	Δ vs No Spatial
No spatial	0.971	baseline
Spatial lag	0.981	+0.010
Fourier encoding	0.982	+0.011
Graph proximity	0.979	+0.008
Clustering	0.978	+0.007
All spatial	0.984	+0.013

Feature Group Importance (Single Group AUC)

Group	AUC (alone)
Environmental	0.786
Demographic	0.707
Socioeconomic	0.562
Nutritional	0.568
Healthcare	0.558

Calibration Methods

Method	ECE ↓	Brier ↓
Uncalibrated	0.038	0.048
Platt scaling	0.029	0.048
Isotonic	0.034	0.047

Visualizations

The model repository includes 10 publication-quality visualizations:

Vulnerability Map - District-level TB vulnerability across India
Hotspot/Coldspot Map - Getis-Ord Gi* classification
Feature Importance - Top 20 predictive features
ROC Curve - Classification performance
Calibration Curve - Reliability diagram
Ablation Heatmap - Spatial encoding & feature group comparisons
Confusion Matrix - Classification outcomes
Regional TB Rates - Distribution by Indian region
Correlation Matrix - Key TB risk factor correlations
State Vulnerability - Distribution across top 15 states

Key Findings

Spatial features are critical: +1.3% AUC from spatial lag, Fourier encoding, graph proximity, and geographic clustering
Environmental features (PM2.5, temperature, rainfall, altitude) are the strongest individual predictors
Platt scaling achieves best calibration (ECE: 0.029) while preserving AUC
Transformer embeddings capture cross-feature interactions that trees miss
Multi-scale clustering (K=5,10,20,50 + DBSCAN) outperforms any single granularity

Technical Stack

Models: XGBoost, LightGBM, FT-Transformer (PyTorch)
Spatial: Getis-Ord Gi*, Spatial Lag, Fourier PE, K-hop Graph Aggregation
Calibration: Platt Scaling, Isotonic Regression, Temperature Scaling
Data: 729 Indian districts × 153 features

Usage

# Clone and run the full pipeline
git clone https://huggingface.co/t6harsh/tb-vulnerability-hotspot-predictor
cd tb-vulnerability-hotspot-predictor/src
python main_pipeline.py

Files

├── src/
│   ├── main_pipeline.py          # Main orchestrator
│   ├── data_generation.py        # Synthetic data based on real distributions
│   ├── spatial_features.py       # All spatial feature engineering
│   ├── ft_transformer.py         # FT-Transformer implementation
│   ├── hybrid_model.py           # Hybrid GBDT + Transformer model
│   ├── calibration.py            # Platt/Isotonic/Temperature calibration
│   └── ablation_study.py         # Comprehensive ablation framework
├── vulnerability_map.png
├── hotspot_coldspot_map.png
├── feature_importance.png
├── roc_curve.png
├── calibration_curve.png
├── ablation_heatmap.png
├── confusion_matrix.png
├── regional_tb_rates.png
├── correlation_matrix.png
├── state_vulnerability.png
├── metrics.json
├── ablation_results.json
└── REPORT.md

License

Apache 2.0

Acknowledgments

Based on published distributions from NFHS-5, Census 2011, WHO GTB, NIKSHAY, and methodology from:

Gorishniy et al., "Revisiting Deep Learning Models for Tabular Data" (FT-Transformer)
Guo et al., "On Calibration of Modern Neural Networks" (Calibration)
Spatial analysis informed by PySAL methodology

Downloads last month: -; Downloads are not tracked for this model. How to track