TB Vulnerability Hotspot Predictor - India

Model Description

An industry-grade hybrid ML pipeline for predicting tuberculosis (TB) vulnerability hotspots across India at district level. Combines gradient-boosted decision trees with transformer-based contextual embeddings for high-resolution vulnerability mapping.

Architecture

Input (153 features) β†’ Quantile Transform
    β”œβ”€β”€ FT-Transformer (d=64, L=2, h=4) β†’ [CLS] embedding (64-dim)
    └── Quantile-transformed features
         ↓
    Concatenated features (153 + 64 = 217)
         ↓
    β”œβ”€β”€ XGBoost (500 trees, depth=5)
    └── LightGBM (500 trees, depth=5)
         ↓
    Ensemble (0.5 Γ— XGB + 0.5 Γ— LGB)
         ↓
    Probabilistic Calibration (Platt + Isotonic + Temperature)
         ↓
    Calibrated Risk Probabilities

Performance

Classification (Hotspot Detection)

Metric Test Set 3-Fold CV
AUC-ROC 0.926 0.984
Accuracy 0.911 0.930
F1 (macro) 0.850 0.905
Precision 0.870 -
Recall 0.667 -
ECE 0.105 -
Brier Score 0.074 -

Regression (TB Notification Rate)

Metric Value
RMSE 9.87 per 100K
RΒ² 0.915
MAE 6.39 per 100K

Ablation Study Results

Spatial Encoding Impact (AUC-ROC)

Method AUC Ξ” vs No Spatial
No spatial 0.971 baseline
Spatial lag 0.981 +0.010
Fourier encoding 0.982 +0.011
Graph proximity 0.979 +0.008
Clustering 0.978 +0.007
All spatial 0.984 +0.013

Feature Group Importance (Single Group AUC)

Group AUC (alone)
Environmental 0.786
Demographic 0.707
Socioeconomic 0.562
Nutritional 0.568
Healthcare 0.558

Calibration Methods

Method ECE ↓ Brier ↓
Uncalibrated 0.038 0.048
Platt scaling 0.029 0.048
Isotonic 0.034 0.047

Visualizations

The model repository includes 10 publication-quality visualizations:

  1. Vulnerability Map - District-level TB vulnerability across India
  2. Hotspot/Coldspot Map - Getis-Ord Gi* classification
  3. Feature Importance - Top 20 predictive features
  4. ROC Curve - Classification performance
  5. Calibration Curve - Reliability diagram
  6. Ablation Heatmap - Spatial encoding & feature group comparisons
  7. Confusion Matrix - Classification outcomes
  8. Regional TB Rates - Distribution by Indian region
  9. Correlation Matrix - Key TB risk factor correlations
  10. State Vulnerability - Distribution across top 15 states

Key Findings

  1. Spatial features are critical: +1.3% AUC from spatial lag, Fourier encoding, graph proximity, and geographic clustering
  2. Environmental features (PM2.5, temperature, rainfall, altitude) are the strongest individual predictors
  3. Platt scaling achieves best calibration (ECE: 0.029) while preserving AUC
  4. Transformer embeddings capture cross-feature interactions that trees miss
  5. Multi-scale clustering (K=5,10,20,50 + DBSCAN) outperforms any single granularity

Technical Stack

  • Models: XGBoost, LightGBM, FT-Transformer (PyTorch)
  • Spatial: Getis-Ord Gi*, Spatial Lag, Fourier PE, K-hop Graph Aggregation
  • Calibration: Platt Scaling, Isotonic Regression, Temperature Scaling
  • Data: 729 Indian districts Γ— 153 features

Usage

# Clone and run the full pipeline
git clone https://huggingface.co/t6harsh/tb-vulnerability-hotspot-predictor
cd tb-vulnerability-hotspot-predictor/src
python main_pipeline.py

Files

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main_pipeline.py          # Main orchestrator
β”‚   β”œβ”€β”€ data_generation.py        # Synthetic data based on real distributions
β”‚   β”œβ”€β”€ spatial_features.py       # All spatial feature engineering
β”‚   β”œβ”€β”€ ft_transformer.py         # FT-Transformer implementation
β”‚   β”œβ”€β”€ hybrid_model.py           # Hybrid GBDT + Transformer model
β”‚   β”œβ”€β”€ calibration.py            # Platt/Isotonic/Temperature calibration
β”‚   └── ablation_study.py         # Comprehensive ablation framework
β”œβ”€β”€ vulnerability_map.png
β”œβ”€β”€ hotspot_coldspot_map.png
β”œβ”€β”€ feature_importance.png
β”œβ”€β”€ roc_curve.png
β”œβ”€β”€ calibration_curve.png
β”œβ”€β”€ ablation_heatmap.png
β”œβ”€β”€ confusion_matrix.png
β”œβ”€β”€ regional_tb_rates.png
β”œβ”€β”€ correlation_matrix.png
β”œβ”€β”€ state_vulnerability.png
β”œβ”€β”€ metrics.json
β”œβ”€β”€ ablation_results.json
└── REPORT.md

License

Apache 2.0

Acknowledgments

Based on published distributions from NFHS-5, Census 2011, WHO GTB, NIKSHAY, and methodology from:

  • Gorishniy et al., "Revisiting Deep Learning Models for Tabular Data" (FT-Transformer)
  • Guo et al., "On Calibration of Modern Neural Networks" (Calibration)
  • Spatial analysis informed by PySAL methodology
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support