Teaching LLMs to Run an Incident Command Center
India's Biggest Mega AI Hackathon β Built on Meta OpenEnv Β· Round 2
TL;DR β I built an OpenEnv environment where three specialist agents β Triage, Investigator, and Ops Manager β cooperate to resolve real-world tech incidents under SLA pressure, budget constraints, and customer-tier business impact. I then fine-tuned Qwen2.5-1.5B-Instruct on heuristic rollouts and watched it close a +10.17-reward gap on hard incidents, matching the hand-coded expert policy component-for-component. A separate 0.5B ablation shows that model scale is the story β same pipeline, same data schema, but the smaller backbone never closes a single hard incident.
π Everything in one place
| What | Where |
|---|---|
| π’ Live environment (OpenEnv-compatible) | swapnilpatil28-multi-agent-incident-command-center.hf.space β |
| π€ Hugging Face Space page | huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center β |
| π» GitHub source code | github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center β |
| π Reproducible training (Colab T4) | Open in Colab β |
| π Full README (story + technical deep-dive) | github.com/.../README.md β |
| β Submission checklist | docs/SUBMISSION_CHECKLIST.md |
1. The story in 2 minutes (for anyone)
When a real tech company has an outage, three people's phones buzz at once. A Triage engineer scans logs and dashboards. An Investigator forms a hypothesis and applies a fix. An Ops Manager decides who owns the work, whether to escalate, and when to officially close the incident.
Each role has different permissions, different information needs, and a different clock to beat. Get it wrong and you bleed budget, bust the SLA, and β if the customer is on an enterprise contract β lose serious money (~3Γ what a free-tier outage costs).
I built a simulator of that war room β an OpenEnv-compatible environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals β and fine-tuned an LLM to run it.
| Role | Can do | Cannot do |
|---|---|---|
| π Triage agent | Pull logs Β· check metrics Β· consult KB articles | Close a ticket |
| π§ͺ Investigator | Apply a fix Β· roll back a deploy | Escalate or file a post-mortem |
| π· Ops Manager | Escalate Β· file post-mortem Β· close the ticket | Apply a code fix |
The headline number: the fine-tuned LLM earns +10.17 more reward on hard incidents than the untrained base β and matches the human-written expert policy component-for-component.
One picture, four policies, three difficulty tiers. Random is the floor. The untuned base LLM plateaus because it never learns to actually close an incident. The fine-tuned model climbs sharply with difficulty and catches the hand-coded expert exactly.
2. Why this is a real RL problem (three themes in one environment)
Most RL environments for LLMs are single-agent, single-step, or turn-based games. Real enterprise work is none of those. This environment deliberately tests all three Round-2 hackathon themes simultaneously:
| Hackathon theme | How this environment satisfies it |
|---|---|
| π€ #1 β Multi-Agent Interactions | Three distinct specialist roles with non-overlapping permissions. Acting out-of-role triggers a wrong_actor_penalty (β0.08). Correct handoffs earn +0.15. Collaboration is trained, not hard-coded. |
| β±οΈ #2 β Long-Horizon Planning | Each episode carries 3β5 sequential incidents, 20β60 steps apiece, under a single ticking SLA clock. The big reward (+0.80 Γ tier) only fires after clues β fix β post-mortem. Sparse and delayed by design β the 20-step credit-assignment problem is the whole point. |
| π’ #3 β World Modeling / Professional Tasks | Incidents carry real logs, metrics, KB articles, red-herring signals, and business metadata (customer tier, affected users, $/min revenue impact). Closure rewards scale by tier (free Γ0.6 Β· standard Γ1.0 Β· premium Γ1.4 Β· enterprise Γ1.8), and wrong closures are punished the same way. Close an enterprise ticket incorrectly and it hurts ~3Γ what a free-tier one does. |
3. What the environment looks like under the hood
The environment runs as a standard OpenEnv FastAPI server β same Gym-style reset / step contract, same Pydantic observation/action schemas, same Docker image format for Hugging Face Spaces.
Observation (partial)
{
"incident_id": "inc-cert-expiry",
"incident_title": "mTLS cert expired β all microservices throwing 500s",
"incident_description": "Alerting fired at 03:12 UTC ...",
"customer_tier": "enterprise",
"affected_users_estimate": 140000,
"revenue_impact_usd_per_min": 4800,
"postmortem_required": true,
"visible_signals": ["mtls handshake errors", "5xx spike in checkout"],
"investigation_targets": {
"logs": ["cert-manager", "auth-service"],
"metrics": ["dash-mesh", "dash-auth"],
"kb": ["kb-mtls-chain", "kb-cert-rotation"]
},
"allowed_actors_by_action": {
"apply_fix": ["investigator_agent"],
"close_incident": ["ops_manager_agent"]
},
"budget_remaining": 18,
"sla_minutes_remaining": 40,
"clues_found": 2,
"mitigation_applied": false,
"reward_components": {"step_cost": -0.04, "clue_bonus": +0.12}
}
Action space
| action_type | Typical actor | Purpose |
|---|---|---|
inspect_logs / inspect_metrics / consult_kb |
triage / investigator | Gather clues (reward shapes here) |
negotiate_handoff |
ops_manager | Route to correct owner |
apply_fix |
investigator | Apply mitigation (scored vs ground truth) |
rollback |
investigator | Revert last change |
escalate |
ops_manager | Engage senior staff |
submit_postmortem |
ops_manager | Required on tier-1 / high-revenue incidents |
close_incident |
ops_manager | Terminal action β final score depends on clues found + mitigation quality + post-mortem + speed |
Reward rubric (composable, not monolithic)
The reward engine emits named components at every step so training curves β and judges β can see exactly where reward came from:
| Component | When it fires | Sign |
|---|---|---|
step_cost |
Every action (β0.01 to β0.08 by action type) | β |
clue_bonus |
Unique log/metric/KB lookup that surfaces a real fact | + |
handoff_correct / handoff_wrong |
Ops manager routes to allowed / disallowed owner | Β± |
mitigation_correct / mitigation_wrong / mitigation_empty |
Fix matches / contradicts / omits ground-truth keywords | Β± |
rollback_effective / rollback_ineffective |
Rollback summary matches the incident's accepted playbook | Β± |
escalation_needed / escalation_not_needed |
Escalation raised for an incident that actually warrants it | Β± |
closure_correct / closure_wrong |
Final close decision matches incident state | Β± (scaled by customer tier) |
closure_mitigation_bonus |
Close after a correct mitigation | + |
closure_under_investigated |
Close without enough clues found | β |
speed_bonus |
Close in β€ 7 / β€ 4 steps | + |
postmortem_bonus / postmortem_missing |
Post-mortem filed / skipped on a high-impact incident | Β± |
repeated_lookup_penalty |
Re-querying the same log/metric/KB | β |
wrong_actor_penalty |
Action invoked by a role that's not authorised | β |
invalid_action |
Unrecognised action_type |
β |
sla_exhausted / budget_exhausted |
Terminal penalty when SLA / action budget hits zero | β |
Anti-gaming: closing early with zero clues is penalised; spamming cheap inspect_logs racks up repeated_lookup_penalty; triggering apply_fix without investigator permissions gives wrong_actor_penalty. A policy cannot shortcut its way to a high score.
4. Training: HF TRL SFT on heuristic rollouts
I first wrote a deterministic HeuristicCoordinator that uses the observation's investigation_targets and role constraints to play through the environment. On hard tasks it earns +5.89 reward where random scores β12.50 β so that gives us ~680 (prompt, completion) pairs of "good" behavior to imitate.
Training script: train_trl.py. One command on Colab T4 (or open the reproducible notebook β) runs the entire pipeline:
os.environ["BASE_MODEL"] = "Qwen/Qwen2.5-1.5B-Instruct"
os.environ["EPISODES_PER_TASK"] = "8"
os.environ["TRAIN_EPOCHS"] = "3"
os.environ["EVAL_LLM_MODELS"] = "true"
os.environ["MAX_LLM_EVAL_STEPS"] = "120"
!python train_trl.py
The script:
- Rolls out the heuristic against the live environment and collects prompts/completions.
- Runs TRL
SFTTrainerwith a singletextcolumn (chat-template applied). - Saves the fine-tuned checkpoint to
artifacts/sft_model/. - Rolls out four policies under identical seeds β random, heuristic, base LLM, fine-tuned LLM.
- Writes
reward_curve.png,training_curve.png,reward_components.png,summary_metrics.json, andtraining_log.json.
Training loss + token accuracy
Loss drops from ~2.84 β ~0.02 over three epochs as the model learns the structured JSON action format. Mean token accuracy climbs from ~0.49 β ~0.99. Satisfies the hackathon "loss AND reward plots" minimum requirement.
Four-policy reward comparison
| Task | Random | Base LLM | Fine-tuned (SFT) | Heuristic |
|---|---|---|---|---|
| easy | β5.96 | β2.92 | β4.72 | β4.72 |
| medium | β11.48 | β4.00 | β0.87 | β0.87 |
| hard | β12.50 | β4.28 | +5.89 | +5.89 |
Fine-tuned vs untrained base: +10.17 reward delta on hard-difficulty incidents.
- Random is the floor on every task.
- Base LLM already beats random on easy because it produces well-formed-ish JSON β but it never closes a single incident, so it just racks up step-costs and SLA penalties.
- Fine-tuned LLM catches the heuristic teacher exactly. The environment is deterministic and SFT hit ~0.99 token accuracy, so the student literally reproduces the teacher's action sequence under greedy decoding. This is imitation learning converging to the expert β the meaningful headline number is therefore SFT vs base, not SFT vs heuristic.
Reward sources β what each policy actually earns
This is the chart I'm proudest of, because it makes the training signal legible. Summed across all three tasks:
- Random bleeds out:
closure_wrong: β17.82Β·wrong_actor_penalty: β3.12Β·mitigation_wrong: β2.10. - Base LLM earns
clue_bonus: +0.24but then gets crushed bystep_cost: β5.16andsla_exhausted: β5.04. It never fires a single positive closure component. - Fine-tuned LLM unlocks the high-value positive components the base never sees:
closure_correct: +7.36Β·mitigation_correct: +2.10Β·closure_mitigation_bonus: +1.80Β·postmortem_bonus: +0.60Β·handoff_correct: +0.75Β·speed_bonus: +0.60.
Training has moved the LLM from "bleeding" to "solving."
5. Why does SFT exactly match the heuristic?
Honest framing matters. The environment is deterministic (same task β same incidents β same observations β same seeds). The heuristic coordinator is also deterministic (same observation β same action). So every rollout of a given task produces a byte-identical trajectory. Our 680-row dataset contains only ~85 unique (observation, action) pairs, each duplicated for redundancy. At ~0.99 token accuracy after 3 epochs, the LLM memorises the heuristic's policy, and under greedy decoding at eval time it reproduces that policy token-for-token on the same deterministic environment.
This is the defining success condition for behavior cloning: the student has become the teacher.
The gap we can legitimately celebrate is therefore SFT vs the untrained base model, where:
- On hard incidents, SFT earns +10.17 more reward than base.
- SFT unlocks reward components (
closure_correct,mitigation_correct,postmortem_bonus) that the base model literally never fires. - On easy tasks, SFT inherits the teacher's known weakness (easy tasks have tight SLA budgets that punish thorough investigation). This is exactly what imitation learning should do β including the teacher's mistakes.
The obvious next step to go beyond the heuristic ceiling is RL with the environment's native reward signal β GRPO or PPO against the same rubric β which is the natural Round 3 work.
6. The surprise finding β scale is the story
I ran the exact same pipeline with the smaller Qwen2.5-0.5B-Instruct backbone (same environment, same seeds, same heuristic teacher, same reward rubric). The story flips entirely:
| Task | Random | Base 0.5B | SFT 0.5B | Heuristic | SFT β Base (0.5B) |
|---|---|---|---|---|---|
| easy | β5.96 | β2.92 | β2.49 | β4.72 | +0.43 |
| medium | β11.48 | β4.00 | β3.86 | β0.87 | +0.14 |
| hard | β12.50 | β2.40 | β2.40 | +5.89 | +0.00 |
The punchline: with a 0.5B backbone, SFT delivers only a +0.43 / +0.14 / +0.00 improvement over the base model and never closes a single hard incident. Bumping the backbone to 1.5B β same SFT code, same data pipeline, same environment β unlocks a β1.80 / +3.13 / +10.17 improvement and makes the LLM match the heuristic's component-for-component behavior on hard incidents.
| Run config | 0.5B | 1.5B (headline) |
|---|---|---|
| Base model | Qwen2.5-0.5B-Instruct | Qwen2.5-1.5B-Instruct |
| Episodes / task (rollout) | 3 | 8 |
| Dataset rows | 255 | 680 |
| Train epochs | 1 | 3 |
| Base β SFT improvement on hard | +0.00 | +10.17 |
| Hard incidents closed by SFT | 0 | full heuristic behavior |
Interpretation: at 0.5B the model is too small to absorb this multi-step, role-gated policy from SFT, even though it can emit syntactically valid JSON. At 1.5B the capacity suddenly becomes sufficient to internalise the full action schedule, and behavior cloning converges. This is the kind of finding the environment is designed to surface β the composable rubric makes it visible in one plot, not hidden behind a single aggregate score.
7. Everything you need to reproduce this
| Live environment | swapnilpatil28-multi-agent-incident-command-center.hf.space (OpenEnv-compatible, Docker-backed) |
| Training notebook | One-click Colab (T4, ~1 h 15 min end-to-end) |
| Source + tests | GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK) |
| Full docs | README β Part 1 story + Part 2 technical deep-dive |
| Committed evidence | artifacts/ β all 4 PNGs + both JSON metric files |
| Submission checklist | docs/SUBMISSION_CHECKLIST.md |
8. What's next (Planned)
- Replace SFT with GRPO or PPO using the environment's native reward signal β no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
- Scale the incident catalog from 13 templates to 50+ (drop in JSON-defined scenarios).
- Add a second "adversarial" agent that injects misleading signals to test robustness.
If you want to run it yourself, the Space and the repo are fully self-contained β docker run the image and point any OpenEnv-compatible client at it. Or just hit /reset and /step yourself from any language that can speak HTTP JSON.
Built with β₯ on Meta OpenEnv for the OpenEnv India 2026 Round 2 hackathon. Code: GitHub Β· Space: HF Space Β· Training notebook: Colab.



