Teaching an LLM to Triage Disasters 🚨

How we built a real RL environment for emergency response — and what we learned when the model hallucinated an entire rescue team

Built for the 2026 Meta & Scalar AI Hackathon, Bangalore.


🎬 Demo Video

▶️ Watch the live demo on YouTube — 2 minutes, fast-forwarded. Watch the agent triage 15 simultaneous disaster incidents in real-time on the live command center dashboard.


It started with a question nobody was asking

What if an LLM had to make the same decisions as the person who picks up the phone during a catastrophe?

Not "write me a poem." Not "solve this math problem."

"The dam is overflowing. 300 people are on rooftops. You have one helicopter. What do you do?"

That's the problem we built for.


🏗️ Architecture

Architecture Diagram The agent runs locally, sends actions to the deployed HF Space OpenEnv server, and the live dashboard updates in real-time via WebSocket.

The agent is fully decoupled from the environment. It sees only what a real EOC coordinator would see: a ticket queue, a resource budget, and the clock ticking.

We built Disaster Response Coordination OpenEnv — an RL environment where an AI agent acts as an Emergency Incident Commander inside a live Emergency Operations Center.

The agent receives a queue of incident tickets. Real ones. Modeled after:

  • 🌊 2018 Kerala Floods — 483 dead, the largest evacuation since Indian Independence. Dam spillway overflow. Communication blackouts. We recreated the exact decision tree EOC coordinators faced.
  • ☠️ 2020 Vizag LG Polymers Gas Leak — 11 dead, 1000+ hospitalized. A toxic plume drifting over residential areas. Do you evacuate north or south? Wind direction matters.
  • 2012 North India Grid Failure — 620 million people without power. Cold-chain medicines failing in hospitals across 7 states. Which hospital gets the generator truck first?

Every ticket the agent sees is based on a real event. Every decision has real stakes baked into the reward function.

For each incident ticket, the agent must execute a precise 4-step workflow:

classify → set_priority → draft_reply → submit_ticket

Miss a step? Penalty. Wrong team? Partial credit. Right team, wrong priority? You still lose something. There is no lucky guess that beats the system.


The Reward Function: Built to Be Unhackable

Most RL environments get reward-hacked in under 100 steps. We designed around that from day one.

ticket_score = 0.40 × team_routing
             + 0.30 × priority_score  
             + 0.30 × reply_quality

task_score   = avg(ticket_scores)
             - invalid_action_penalty   (max 0.15)
             - loop_detection_penalty   (max 0.10)
             - reroute_penalty          (max 0.12)
             - budget_overflow_penalty  (max 0.18)
             - time_pressure_multiplier (Hard mode: 0.75×)

5 independent signals. Dense partial rewards at every step. No sparse end-of-episode surprise. If you get the team right but fumble the priority, you learn something. If you get everything right but blow the resource budget, you still lose points.

"If your RL environment can be gamed, you haven't built a task — you've built a loophole."


📊 Training Results

Reward Curve — GRPO training reward across 3 stages, 135 steps:

Reward Curve

Epoch Comparison — Average reward per training epoch:

Epoch Comparison

Before vs After Training — Behavioral comparison of model outputs:

Before vs After

Training Hyperparameters — Full config used for the v2 run:

Training Parameters

We fine-tuned Qwen2.5-7B-Instruct using GRPO (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth on a Colab GPU.

The first thing we discovered? The base model immediately hallucinated an entirely new rescue team.

❌  team: "emergency_services"   (not in the valid set)
❌  team: "utility repair"       (the agent made this up)
❌  priority: "very-high"        (also made up)
❌  priority: "immediately"      (still wrong)

The model had read enough emergency management documents to know the vibe of disaster response — but it had no idea what valid actions actually existed in our environment.

That's exactly the kind of failure RL is designed to fix.

After 3 training stages and 135 steps:

✅  team: "rescue"
✅  priority: "urgent"  
✅  JSON output: perfectly structured

The model learned to stop inventing API routes and start operating within the defined action space. This is sparse reward collapse — a documented RL failure mode where small models struggle to optimize multi-step interdependent workflows. Our environment was hard enough to expose it. That's a feature, not a bug.


The Benchmark Results

We ran the trained model across all 3 difficulty tiers against the live deployed environment:

Agent Easy Medium Hard Avg
Heuristic Baseline (hardcoded rules) 0.704 0.683 0.660 0.682
GRPO Qwen2.5-7B v2 (ours) 0.641 0.665 0.601 0.636

All 3 tiers: ✅ PASS ✅ PASS ✅ PASS

The heuristic baseline uses hand-crafted regex patterns and keyword matching. Zero generalisation. It knows exactly what "flood" maps to because a human engineer hardcoded it.

Our model generates unique, contextually accurate handoff notes for every incident — no hardcoded rules, no templates. It reads the situation and decides. The fact that it stays within 4.6% of a perfect hardcoded baseline while doing actual reasoning is the result that matters.


The Dashboard: Because Judges Are Human Too

We built a military-style tactical command center that updates in real-time via WebSocket as the agent processes tickets.

  • 🗺️ OpenStreetMap with color-coded incident markers (red = urgent, orange = high, ✓ = resolved)
  • ⚡ ARIA — an AI Incident Analyst powered by Gemini, available for live analysis of any incident
  • 📊 Real-time score tracker, resource budget bar, team routing feed
  • 🔔 Operations feed with audio alerts

It is not a static demo. When you run inference.py, the dashboard updates live. You can watch the agent work in real-time.

▶️ Open the Command Center


Try It Yourself

git clone https://github.com/letsjoyn/meta-scalar-hack.git
cd meta-scalar-hack
pip install -e .

# Run the agent against the live environment
$env:OPENENV_BASE_URL = "https://joynnayvedya-disaster-response-openenv.hf.space"
$env:API_BASE_URL     = "https://router.huggingface.co/v1"
$env:MODEL_NAME       = "Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN         = "hf_YOUR_TOKEN"
py inference.py

Links

Resource URL
🤗 HF Space (Live Environment) joynnayvedya/disaster-response-openenv
🧠 Trained Model joynnayvedya/disaster-response-v2
📓 Training Notebook (Colab) Open in Colab
💻 GitHub letsjoyn/meta-scalar-hack

Built for the 2026 Meta & Scalar AI Hackathon — Grand Finale, Bangalore.

Every scenario based on a real disaster. Every reward signal designed to be unhackable.

Uploaded finetuned model

  • Developed by: joynnayvedya
  • License: apache-2.0
  • Finetuned from model : unsloth/Qwen2.5-7B-Instruct-bnb-4bit

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
294
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joynnayvedya/disaster-response-v2

Base model

Qwen/Qwen2.5-7B
Finetuned
(132)
this model