SwapnilPatil28's picture
Final Docs Update
8062d98 verified

Teaching LLMs to Run an Incident Command Center

India's Biggest Mega AI Hackathon β€” Built on Meta OpenEnv Β· Round 2

TL;DR β€” I built an OpenEnv environment where three specialist agents β€” Triage, Investigator, and Ops Manager β€” cooperate to resolve real-world tech incidents under SLA pressure, budget constraints, and customer-tier business impact. I then fine-tuned Qwen2.5-1.5B-Instruct on heuristic rollouts and watched it close a +10.17-reward gap on hard incidents, matching the hand-coded expert policy component-for-component. A separate 0.5B ablation shows that model scale is the story β€” same pipeline, same data schema, but the smaller backbone never closes a single hard incident.

πŸ”— Everything in one place

What Where
🟒 Live environment (OpenEnv-compatible) swapnilpatil28-multi-agent-incident-command-center.hf.space β†—
πŸ€— Hugging Face Space page huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center β†—
πŸ’» GitHub source code github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center β†—
πŸŽ“ Reproducible training (Colab T4) Open in Colab β†—
πŸ“– Full README (story + technical deep-dive) github.com/.../README.md β†—
βœ… Submission checklist docs/SUBMISSION_CHECKLIST.md

1. The story in 2 minutes (for anyone)

When a real tech company has an outage, three people's phones buzz at once. A Triage engineer scans logs and dashboards. An Investigator forms a hypothesis and applies a fix. An Ops Manager decides who owns the work, whether to escalate, and when to officially close the incident.

Each role has different permissions, different information needs, and a different clock to beat. Get it wrong and you bleed budget, bust the SLA, and β€” if the customer is on an enterprise contract β€” lose serious money (~3Γ— what a free-tier outage costs).

I built a simulator of that war room β€” an OpenEnv-compatible environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals β€” and fine-tuned an LLM to run it.

Role Can do Cannot do
πŸ” Triage agent Pull logs Β· check metrics Β· consult KB articles Close a ticket
πŸ§ͺ Investigator Apply a fix Β· roll back a deploy Escalate or file a post-mortem
πŸ‘· Ops Manager Escalate Β· file post-mortem Β· close the ticket Apply a code fix

The headline number: the fine-tuned LLM earns +10.17 more reward on hard incidents than the untrained base β€” and matches the human-written expert policy component-for-component.

Reward curve comparing random, base LLM, fine-tuned LLM, and heuristic on easy, medium, and hard tasks

One picture, four policies, three difficulty tiers. Random is the floor. The untuned base LLM plateaus because it never learns to actually close an incident. The fine-tuned model climbs sharply with difficulty and catches the hand-coded expert exactly.


2. Why this is a real RL problem (three themes in one environment)

Most RL environments for LLMs are single-agent, single-step, or turn-based games. Real enterprise work is none of those. This environment deliberately tests all three Round-2 hackathon themes simultaneously:

Hackathon theme How this environment satisfies it
🀝 #1 β€” Multi-Agent Interactions Three distinct specialist roles with non-overlapping permissions. Acting out-of-role triggers a wrong_actor_penalty (βˆ’0.08). Correct handoffs earn +0.15. Collaboration is trained, not hard-coded.
⏱️ #2 β€” Long-Horizon Planning Each episode carries 3–5 sequential incidents, 20–60 steps apiece, under a single ticking SLA clock. The big reward (+0.80 Γ— tier) only fires after clues β†’ fix β†’ post-mortem. Sparse and delayed by design β€” the 20-step credit-assignment problem is the whole point.
🏒 #3 β€” World Modeling / Professional Tasks Incidents carry real logs, metrics, KB articles, red-herring signals, and business metadata (customer tier, affected users, $/min revenue impact). Closure rewards scale by tier (free Γ—0.6 Β· standard Γ—1.0 Β· premium Γ—1.4 Β· enterprise Γ—1.8), and wrong closures are punished the same way. Close an enterprise ticket incorrectly and it hurts ~3Γ— what a free-tier one does.

3. What the environment looks like under the hood

The environment runs as a standard OpenEnv FastAPI server β€” same Gym-style reset / step contract, same Pydantic observation/action schemas, same Docker image format for Hugging Face Spaces.

Observation (partial)

{
  "incident_id": "inc-cert-expiry",
  "incident_title": "mTLS cert expired β€” all microservices throwing 500s",
  "incident_description": "Alerting fired at 03:12 UTC ...",
  "customer_tier": "enterprise",
  "affected_users_estimate": 140000,
  "revenue_impact_usd_per_min": 4800,
  "postmortem_required": true,
  "visible_signals": ["mtls handshake errors", "5xx spike in checkout"],
  "investigation_targets": {
    "logs": ["cert-manager", "auth-service"],
    "metrics": ["dash-mesh", "dash-auth"],
    "kb": ["kb-mtls-chain", "kb-cert-rotation"]
  },
  "allowed_actors_by_action": {
    "apply_fix": ["investigator_agent"],
    "close_incident": ["ops_manager_agent"]
  },
  "budget_remaining": 18,
  "sla_minutes_remaining": 40,
  "clues_found": 2,
  "mitigation_applied": false,
  "reward_components": {"step_cost": -0.04, "clue_bonus": +0.12}
}

Action space

action_type Typical actor Purpose
inspect_logs / inspect_metrics / consult_kb triage / investigator Gather clues (reward shapes here)
negotiate_handoff ops_manager Route to correct owner
apply_fix investigator Apply mitigation (scored vs ground truth)
rollback investigator Revert last change
escalate ops_manager Engage senior staff
submit_postmortem ops_manager Required on tier-1 / high-revenue incidents
close_incident ops_manager Terminal action β€” final score depends on clues found + mitigation quality + post-mortem + speed

Reward rubric (composable, not monolithic)

The reward engine emits named components at every step so training curves β€” and judges β€” can see exactly where reward came from:

Component When it fires Sign
step_cost Every action (βˆ’0.01 to βˆ’0.08 by action type) βˆ’
clue_bonus Unique log/metric/KB lookup that surfaces a real fact +
handoff_correct / handoff_wrong Ops manager routes to allowed / disallowed owner Β±
mitigation_correct / mitigation_wrong / mitigation_empty Fix matches / contradicts / omits ground-truth keywords Β±
rollback_effective / rollback_ineffective Rollback summary matches the incident's accepted playbook Β±
escalation_needed / escalation_not_needed Escalation raised for an incident that actually warrants it Β±
closure_correct / closure_wrong Final close decision matches incident state Β± (scaled by customer tier)
closure_mitigation_bonus Close after a correct mitigation +
closure_under_investigated Close without enough clues found βˆ’
speed_bonus Close in ≀ 7 / ≀ 4 steps +
postmortem_bonus / postmortem_missing Post-mortem filed / skipped on a high-impact incident Β±
repeated_lookup_penalty Re-querying the same log/metric/KB βˆ’
wrong_actor_penalty Action invoked by a role that's not authorised βˆ’
invalid_action Unrecognised action_type βˆ’
sla_exhausted / budget_exhausted Terminal penalty when SLA / action budget hits zero βˆ’

Anti-gaming: closing early with zero clues is penalised; spamming cheap inspect_logs racks up repeated_lookup_penalty; triggering apply_fix without investigator permissions gives wrong_actor_penalty. A policy cannot shortcut its way to a high score.


4. Training: HF TRL SFT on heuristic rollouts

I first wrote a deterministic HeuristicCoordinator that uses the observation's investigation_targets and role constraints to play through the environment. On hard tasks it earns +5.89 reward where random scores βˆ’12.50 β€” so that gives us ~680 (prompt, completion) pairs of "good" behavior to imitate.

Training script: train_trl.py. One command on Colab T4 (or open the reproducible notebook β†—) runs the entire pipeline:

os.environ["BASE_MODEL"]         = "Qwen/Qwen2.5-1.5B-Instruct"
os.environ["EPISODES_PER_TASK"]  = "8"
os.environ["TRAIN_EPOCHS"]       = "3"
os.environ["EVAL_LLM_MODELS"]    = "true"
os.environ["MAX_LLM_EVAL_STEPS"] = "120"
!python train_trl.py

The script:

  1. Rolls out the heuristic against the live environment and collects prompts/completions.
  2. Runs TRL SFTTrainer with a single text column (chat-template applied).
  3. Saves the fine-tuned checkpoint to artifacts/sft_model/.
  4. Rolls out four policies under identical seeds β€” random, heuristic, base LLM, fine-tuned LLM.
  5. Writes reward_curve.png, training_curve.png, reward_components.png, summary_metrics.json, and training_log.json.

Training loss + token accuracy

SFT training loss dropping from ~2.84 to ~0.02 and token accuracy climbing from ~0.49 to ~0.99 over 3 epochs

Loss drops from ~2.84 β†’ ~0.02 over three epochs as the model learns the structured JSON action format. Mean token accuracy climbs from ~0.49 β†’ ~0.99. Satisfies the hackathon "loss AND reward plots" minimum requirement.

Four-policy reward comparison

Reward curve comparing random, base LLM, fine-tuned LLM, and heuristic on easy, medium, and hard tasks

Task Random Base LLM Fine-tuned (SFT) Heuristic
easy βˆ’5.96 βˆ’2.92 βˆ’4.72 βˆ’4.72
medium βˆ’11.48 βˆ’4.00 βˆ’0.87 βˆ’0.87
hard βˆ’12.50 βˆ’4.28 +5.89 +5.89

Fine-tuned vs untrained base: +10.17 reward delta on hard-difficulty incidents.

  • Random is the floor on every task.
  • Base LLM already beats random on easy because it produces well-formed-ish JSON β€” but it never closes a single incident, so it just racks up step-costs and SLA penalties.
  • Fine-tuned LLM catches the heuristic teacher exactly. The environment is deterministic and SFT hit ~0.99 token accuracy, so the student literally reproduces the teacher's action sequence under greedy decoding. This is imitation learning converging to the expert β€” the meaningful headline number is therefore SFT vs base, not SFT vs heuristic.

Reward sources β€” what each policy actually earns

Stacked-bar chart showing where each policy earns or loses reward, broken down by rubric component

This is the chart I'm proudest of, because it makes the training signal legible. Summed across all three tasks:

  • Random bleeds out: closure_wrong: βˆ’17.82 Β· wrong_actor_penalty: βˆ’3.12 Β· mitigation_wrong: βˆ’2.10.
  • Base LLM earns clue_bonus: +0.24 but then gets crushed by step_cost: βˆ’5.16 and sla_exhausted: βˆ’5.04. It never fires a single positive closure component.
  • Fine-tuned LLM unlocks the high-value positive components the base never sees: closure_correct: +7.36 Β· mitigation_correct: +2.10 Β· closure_mitigation_bonus: +1.80 Β· postmortem_bonus: +0.60 Β· handoff_correct: +0.75 Β· speed_bonus: +0.60.

Training has moved the LLM from "bleeding" to "solving."


5. Why does SFT exactly match the heuristic?

Honest framing matters. The environment is deterministic (same task β†’ same incidents β†’ same observations β†’ same seeds). The heuristic coordinator is also deterministic (same observation β†’ same action). So every rollout of a given task produces a byte-identical trajectory. Our 680-row dataset contains only ~85 unique (observation, action) pairs, each duplicated for redundancy. At ~0.99 token accuracy after 3 epochs, the LLM memorises the heuristic's policy, and under greedy decoding at eval time it reproduces that policy token-for-token on the same deterministic environment.

This is the defining success condition for behavior cloning: the student has become the teacher.

The gap we can legitimately celebrate is therefore SFT vs the untrained base model, where:

  • On hard incidents, SFT earns +10.17 more reward than base.
  • SFT unlocks reward components (closure_correct, mitigation_correct, postmortem_bonus) that the base model literally never fires.
  • On easy tasks, SFT inherits the teacher's known weakness (easy tasks have tight SLA budgets that punish thorough investigation). This is exactly what imitation learning should do β€” including the teacher's mistakes.

The obvious next step to go beyond the heuristic ceiling is RL with the environment's native reward signal β€” GRPO or PPO against the same rubric β€” which is the natural Round 3 work.


6. The surprise finding β€” scale is the story

I ran the exact same pipeline with the smaller Qwen2.5-0.5B-Instruct backbone (same environment, same seeds, same heuristic teacher, same reward rubric). The story flips entirely:

Reward curve for the 0.5B ablation β€” SFT barely improves over base and never closes a hard incident

Task Random Base 0.5B SFT 0.5B Heuristic SFT βˆ’ Base (0.5B)
easy βˆ’5.96 βˆ’2.92 βˆ’2.49 βˆ’4.72 +0.43
medium βˆ’11.48 βˆ’4.00 βˆ’3.86 βˆ’0.87 +0.14
hard βˆ’12.50 βˆ’2.40 βˆ’2.40 +5.89 +0.00

The punchline: with a 0.5B backbone, SFT delivers only a +0.43 / +0.14 / +0.00 improvement over the base model and never closes a single hard incident. Bumping the backbone to 1.5B β€” same SFT code, same data pipeline, same environment β€” unlocks a βˆ’1.80 / +3.13 / +10.17 improvement and makes the LLM match the heuristic's component-for-component behavior on hard incidents.

Run config 0.5B 1.5B (headline)
Base model Qwen2.5-0.5B-Instruct Qwen2.5-1.5B-Instruct
Episodes / task (rollout) 3 8
Dataset rows 255 680
Train epochs 1 3
Base β†’ SFT improvement on hard +0.00 +10.17
Hard incidents closed by SFT 0 full heuristic behavior

Interpretation: at 0.5B the model is too small to absorb this multi-step, role-gated policy from SFT, even though it can emit syntactically valid JSON. At 1.5B the capacity suddenly becomes sufficient to internalise the full action schedule, and behavior cloning converges. This is the kind of finding the environment is designed to surface β€” the composable rubric makes it visible in one plot, not hidden behind a single aggregate score.


7. Everything you need to reproduce this

Live environment swapnilpatil28-multi-agent-incident-command-center.hf.space (OpenEnv-compatible, Docker-backed)
Training notebook One-click Colab (T4, ~1 h 15 min end-to-end)
Source + tests GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)
Full docs README β€” Part 1 story + Part 2 technical deep-dive
Committed evidence artifacts/ β€” all 4 PNGs + both JSON metric files
Submission checklist docs/SUBMISSION_CHECKLIST.md

8. What's next (Planned)

  • Replace SFT with GRPO or PPO using the environment's native reward signal β€” no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
  • Scale the incident catalog from 13 templates to 50+ (drop in JSON-defined scenarios).
  • Add a second "adversarial" agent that injects misleading signals to test robustness.

If you want to run it yourself, the Space and the repo are fully self-contained β€” docker run the image and point any OpenEnv-compatible client at it. Or just hit /reset and /step yourself from any language that can speak HTTP JSON.


Built with β™₯ on Meta OpenEnv for the OpenEnv India 2026 Round 2 hackathon. Code: GitHub Β· Space: HF Space Β· Training notebook: Colab.