TB Vulnerability Hotspot Predictor - India
Model Description
An industry-grade hybrid ML pipeline for predicting tuberculosis (TB) vulnerability hotspots across India at district level. Combines gradient-boosted decision trees with transformer-based contextual embeddings for high-resolution vulnerability mapping.
Architecture
Input (153 features) β Quantile Transform
βββ FT-Transformer (d=64, L=2, h=4) β [CLS] embedding (64-dim)
βββ Quantile-transformed features
β
Concatenated features (153 + 64 = 217)
β
βββ XGBoost (500 trees, depth=5)
βββ LightGBM (500 trees, depth=5)
β
Ensemble (0.5 Γ XGB + 0.5 Γ LGB)
β
Probabilistic Calibration (Platt + Isotonic + Temperature)
β
Calibrated Risk Probabilities
Performance
Classification (Hotspot Detection)
| Metric |
Test Set |
3-Fold CV |
| AUC-ROC |
0.926 |
0.984 |
| Accuracy |
0.911 |
0.930 |
| F1 (macro) |
0.850 |
0.905 |
| Precision |
0.870 |
- |
| Recall |
0.667 |
- |
| ECE |
0.105 |
- |
| Brier Score |
0.074 |
- |
Regression (TB Notification Rate)
| Metric |
Value |
| RMSE |
9.87 per 100K |
| RΒ² |
0.915 |
| MAE |
6.39 per 100K |
Ablation Study Results
Spatial Encoding Impact (AUC-ROC)
| Method |
AUC |
Ξ vs No Spatial |
| No spatial |
0.971 |
baseline |
| Spatial lag |
0.981 |
+0.010 |
| Fourier encoding |
0.982 |
+0.011 |
| Graph proximity |
0.979 |
+0.008 |
| Clustering |
0.978 |
+0.007 |
| All spatial |
0.984 |
+0.013 |
Feature Group Importance (Single Group AUC)
| Group |
AUC (alone) |
| Environmental |
0.786 |
| Demographic |
0.707 |
| Socioeconomic |
0.562 |
| Nutritional |
0.568 |
| Healthcare |
0.558 |
Calibration Methods
| Method |
ECE β |
Brier β |
| Uncalibrated |
0.038 |
0.048 |
| Platt scaling |
0.029 |
0.048 |
| Isotonic |
0.034 |
0.047 |
Visualizations
The model repository includes 10 publication-quality visualizations:
- Vulnerability Map - District-level TB vulnerability across India
- Hotspot/Coldspot Map - Getis-Ord Gi* classification
- Feature Importance - Top 20 predictive features
- ROC Curve - Classification performance
- Calibration Curve - Reliability diagram
- Ablation Heatmap - Spatial encoding & feature group comparisons
- Confusion Matrix - Classification outcomes
- Regional TB Rates - Distribution by Indian region
- Correlation Matrix - Key TB risk factor correlations
- State Vulnerability - Distribution across top 15 states
Key Findings
- Spatial features are critical: +1.3% AUC from spatial lag, Fourier encoding, graph proximity, and geographic clustering
- Environmental features (PM2.5, temperature, rainfall, altitude) are the strongest individual predictors
- Platt scaling achieves best calibration (ECE: 0.029) while preserving AUC
- Transformer embeddings capture cross-feature interactions that trees miss
- Multi-scale clustering (K=5,10,20,50 + DBSCAN) outperforms any single granularity
Technical Stack
- Models: XGBoost, LightGBM, FT-Transformer (PyTorch)
- Spatial: Getis-Ord Gi*, Spatial Lag, Fourier PE, K-hop Graph Aggregation
- Calibration: Platt Scaling, Isotonic Regression, Temperature Scaling
- Data: 729 Indian districts Γ 153 features
Usage
git clone https://huggingface.co/t6harsh/tb-vulnerability-hotspot-predictor
cd tb-vulnerability-hotspot-predictor/src
python main_pipeline.py
Files
βββ src/
β βββ main_pipeline.py # Main orchestrator
β βββ data_generation.py # Synthetic data based on real distributions
β βββ spatial_features.py # All spatial feature engineering
β βββ ft_transformer.py # FT-Transformer implementation
β βββ hybrid_model.py # Hybrid GBDT + Transformer model
β βββ calibration.py # Platt/Isotonic/Temperature calibration
β βββ ablation_study.py # Comprehensive ablation framework
βββ vulnerability_map.png
βββ hotspot_coldspot_map.png
βββ feature_importance.png
βββ roc_curve.png
βββ calibration_curve.png
βββ ablation_heatmap.png
βββ confusion_matrix.png
βββ regional_tb_rates.png
βββ correlation_matrix.png
βββ state_vulnerability.png
βββ metrics.json
βββ ablation_results.json
βββ REPORT.md
License
Apache 2.0
Acknowledgments
Based on published distributions from NFHS-5, Census 2011, WHO GTB, NIKSHAY, and methodology from:
- Gorishniy et al., "Revisiting Deep Learning Models for Tabular Data" (FT-Transformer)
- Guo et al., "On Calibration of Modern Neural Networks" (Calibration)
- Spatial analysis informed by PySAL methodology