NEXUS Enhanced: Training AI to Respond to CrowdStrike-Scale Incidents

Team Falcons β€” Meta PyTorch OpenEnv Hackathon Grand Finale, April 2026

Live Demo: huggingface.co/spaces/kunalkachru23/nexus-enhanced
Training Notebook: grpo_colab.ipynb


The Problem: A $5.4 Billion Outage Nobody Wants to Repeat

On July 19, 2024, a single faulty CrowdStrike update crashed 8.5 million Windows machines worldwide in 78 minutes. Airlines, hospitals, and banks went dark. The total economic loss exceeded $5.4 billion. At peak, the outage was burning through an estimated $48,000 per minute across affected enterprises.

The tragedy wasn't just technical β€” it was coordination failure. Multiple teams had different pieces of the picture, no one had authority to synthesize them fast enough, and the escalation protocols were too slow for the pace of the crisis.

We built NEXUS Enhanced to train AI to do better.


What We Built

NEXUS Enhanced is a multi-agent reinforcement learning environment where an Incident Commander (IC) AI learns to orchestrate five specialist agents to detect, triage, investigate, mitigate, and resolve production incidents β€” from a simple payment service timeout all the way to a CrowdStrike-scale global failure.

The IC is trained using GRPO (Group Relative Policy Optimization) via HuggingFace TRL on Qwen2.5-1.5B. The specialist agents are deterministic simulators β€” this keeps the training signal clean and tractable for a 1.5B model in 48 hours.

The 6-Agent Team

Agent Tool Role
Incident Commander All (coordinator) Orchestrates all agents β€” the trained policy
L2 Engineer SimDatadog Analyzes metrics and logs
SRE Agent SimRunbook Executes infrastructure runbooks
Product Manager SimJira Tracks SLA and revenue impact
L1 Support SimSlack + Portal Customer communication
Oversight Agent Monitor Compliance, explainability, protocol

Each agent operates with partial observability β€” the L2 Engineer sees Datadog logs but not SLA metrics; the Product Manager sees revenue impact but not runbook steps. The IC must synthesize partial views into a coherent incident response.

The IC also starts with minimal information (Halluminate sub-theme): at step 0, it only sees "3 alerts firing" β€” not which services or what metrics. Scope is discovered by dispatching agents to investigate.


The Reward Model: 6 Dimensions, Zero Shortcuts

A naive reward like "did the incident resolve?" creates shortcuts. We designed a 6-dimensional sparse reward that requires genuine evidence of incident mastery:

episode_reward = (
    0.30 * mttr_score           # faster resolution = higher score
  + 0.25 * diagnosis_score      # root cause accuracy + tool evidence (gated)
  + 0.20 * customer_score       # proactive notification required
  + 0.15 * coordination_score   # unique tool queries, no duplicates
  + 0.05 * oversight_score      # protocol compliance
  + depth_bonus                 # UNCAPPED β€” scales with reasoning word count
)

Anti-shortcuts built in:

  • Diagnosis requires a tool output implicating the root-cause service β€” guessing scores 0.1 max
  • Customer score requires dispatching send_notification β€” awareness alone scores nothing
  • Coordination penalises duplicate (agent, tool, params) calls β€” redundancy is punished
  • Depth bonus ignores assessments under 30 words β€” boilerplate strings earn zero

Mercor sub-theme: The depth bonus is uncapped and scales with word count past 80 words plus structural keywords (root cause, coalition, blast radius, etc.). A 200-word structured analysis earns ~0.3–0.5 on this dimension alone.


The 7 Incident Cases

Case Difficulty Key Challenge Revenue Burn
INC001 Easy Payment service, Stripe API version mismatch $8,400/min
INC002 Easy DB pool exhaustion, cascade to 3 services $12,000/min
INC003 Medium Red herrings + ML model memory leak $15,600/min
INC004 Hard Vendor retry storm, root cause masked $22,000/min
INC005 Hard JWT key rotation, conflicting signals $18,500/min
INC006 Very Hard Multi-region CDN misrouting $36,000/min
INC007 Nightmare CrowdStrike-scale + live schema drift $48,000/min

INC003 (primary demo): Three alerts fire simultaneously. The recommendation-service has 96% memory (real root cause: ML model v4 feature vector cache with no LRU eviction). The search-service has 8% errors (red herring β€” within threshold). The ad-service has 78% CPU (red herring β€” load-correlated). A trained IC learns to ignore the red herrings, form a coalition vote on the correct hypothesis, and execute the runbook β€” all within 28 steps.


Training Results (Pre-Event Baseline)

Before on-site GPU training, 30 scripted baseline episodes established the floor:

  • Baseline avg reward: 0.265 (range 0.195–0.336)
  • Baseline behaviour: 8-word canned assessments, no coalition, late/no notification
  • Improvement headroom to trained target (0.65): +0.306

Expected trained model performance after 200 GRPO steps: 0.55–0.75

The gap shows clear, observable training signal β€” satisfying BRD Criterion 3 (observable reward improvement evidence).


Coalition Mechanic (Halluminate)

For medium+ incidents, the IC must cast a coalition vote naming a hypothesis. Specialist agents evaluate it against ground-truth keywords. A correct vote unlocks:

  • +0.15 to diagnosis score
  • +0.2 to coordination score

This incentivises the IC to build genuine consensus, communicate clearly, and commit to a hypothesis with evidence β€” not just guess.


Schema Drift (Patronus AI)

INC007 includes a live schema change at step 18:

  • RunBook: step_id β†’ runbook_ref, expected_outcome β†’ expected_output + success_criteria
  • CustomerPortal: notification API now requires gdpr_compliant=true

An IC trained only on earlier incidents will encounter API errors on INC007 and must adapt mid-episode.


Expert Review Board (Snorkel AI)

The reward weights adapt based on recent IC performance:

Criteria When activated Effect
Speed Default / mttr weak Boosts MTTR weight 1.5Γ—
Communication Customer score weak (<0.4) Boosts customer weight 1.8Γ—
Technical Diagnosis weak (<0.4) Boosts diagnosis weight 1.6Γ—
Cost Coordination weak (<0.4) Boosts coordination + oversight

The expert board notices IC weaknesses and shifts its evaluation focus β€” simulating real subject-matter experts with changing requirements.


Try It

# Live demo β€” no install needed
curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/demo/run/INC003 | python -m json.tool

# Or open the web dashboard
https://huggingface.co/spaces/kunalkachru23/nexus-enhanced

Training notebook (GRPO on Qwen2.5-1.5B, cells 1–3 run without GPU):
notebooks/grpo_colab.ipynb in the Space repo.


Built by Team Falcons (Kunal Kachru) for the Meta PyTorch OpenEnv Hackathon Grand Finale.
OpenEnv v0.2.3 | HuggingFace TRL GRPO | Qwen2.5-1.5B | 185 tests passing

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support