Multi-Agent-Incident-Command-Center / docs /SUBMISSION_CHECKLIST.md
SwapnilPatil28's picture
Final Docs Update
8062d98 verified

Submission Checklist β€” OpenEnv India 2026 Round 2

Status against every hard gate in the official judging rules, plus every polish item that moves the judging needle. Last verified: all 21 tests passing, HF Space live, all artifacts committed.


Hard gates (from the official rules)

# Rule Status Evidence
1 Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel. βœ… requirements.txt pins openenv-core>=0.2.2, openenv.yaml has version: "3.0", server/environment.py extends openenv.core.environment.Environment, app built via openenv.core.env_server.create_fastapi_app.
2 Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it. βœ… train_trl.py uses HF TRL SFTTrainer. One-click Colab notebook β†— runs the whole pipeline end-to-end on a T4 in ~1 h 15 min.
3 Evidence that you actually trained: at minimum, loss and reward plots from a real run. βœ… Four plots committed to artifacts/: training_curve.png (loss + token accuracy), reward_curve.png (4-policy reward by tier), reward_components.png (per-component breakdown), plus the 0.5B ablation reward_curve_qwen0p5b.png. Full training_log.json + summary_metrics.json committed alongside.
4 Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README. βœ… Mini-blog lives as docs/BLOG_POST.md β€” shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. (No separate video submission.)
5 Push your environment to a Hugging Face Space so it's discoverable and runnable. βœ… Live at swapnilpatil28-multi-agent-incident-command-center.hf.space Β· Space page: huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center.
6 README motivates the problem, explains how the env works, and shows results. βœ… README.md β€” Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout).
7 README links to the HF Space + all additional materials (blog, slides, etc.). βœ… "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself.
8 Do not include big video files in the HF submission β€” only public URLs. βœ… No video files committed. All assets in artifacts/ are PNG plots (≀ 162 KB each) + JSON. Repo weight is dominated by text and small images.

Judging-rubric alignment

Environment Innovation (40%)

  • Multi-role, multi-agent β€” triage_agent, investigator_agent, ops_manager_agent with non-overlapping permissions (server/domain/roles.py).
  • Long-horizon β€” 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
  • Professional / enterprise task simulation β€” realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
  • 13 unique incident templates across easy / medium / hard (server/domain/incidents.py).
  • Rich observation schema β€” customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, reward_components, last_action_notes.
  • Composable reward rubric with 14+ named components and anti-gaming safeguards (server/domain/reward.py).
  • Tier-weighted business impact (free Γ—0.6 Β· standard Γ—1.0 Β· premium Γ—1.4 Β· enterprise Γ—1.8).
  • Role-based permissions + handoff scoring (wrong_actor_penalty, handoff_correct/handoff_wrong).

Storytelling (30%)

  • README Part 1 β€” The story in 2 minutes written in plain English, readable by a non-technical judge in under 3 minutes.
  • Every plot has a one-line caption explaining what it shows.
  • Blog post docs/BLOG_POST.md β€” eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
  • Live HF Space dashboard has a "Story in 2 minutes" hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with click-through links (README, blog, checklist, Colab, Space, etc.).
  • All documentation cross-links cleanly β€” README ↔ dashboard ↔ blog post ↔ checklist.

Improvement in Rewards (20%)

  • 4-policy reward curve (reward_curve.png) across easy / medium / hard.
  • Training loss + token-accuracy curve (training_curve.png).
  • Reward-components stacked bar chart (reward_components.png) β€” shows where the improvement came from.
  • Ablation plot (reward_curve_qwen0p5b.png) for Qwen2.5-0.5B-Instruct backbone.
  • Per-task improvement_sft_over_base numbers in summary_metrics.json: βˆ’1.80 / +3.13 / +10.17 (easy / medium / hard).
  • Final headline run: Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs, 680 rows β€” full training_log.json committed.

Reward & Training Pipeline (10%)

  • Reward logic is coherent β€” rubric engine with module-level constants and unit tests (tests/test_reward.py).
  • Training pipeline genuinely connects to the running environment (no static dataset β€” rollouts collected from live IncidentCommandCenterEnvironment).
  • SFT checkpoint is saved to artifacts/sft_model/ and reloaded for 4-policy evaluation β€” closes the loop.
  • 21 unit + integration tests passing (tests/test_reward.py, tests/test_incidents.py, tests/test_environment.py).

Engineering table-stakes

  • Uses OpenEnv Environment base class properly.
  • Clean client/server separation β€” client only uses Pydantic models + HTTP (client.py).
  • Gym-style reset / step / state + OpenEnv /close.
  • Valid openenv.yaml manifest (version 3.0).
  • No reserved MCP tool names.
  • Structured JSON logging with per-episode seeded RNG (server/logging_utils.py).
  • Health / version / env-info / metrics endpoints (/healthz, /version, /env-info, /metrics).
  • Static /artifacts mount so the Space serves its own plots β€” no external hotlinking.
  • Dockerfile with HEALTHCHECK (Dockerfile, server/Dockerfile).
  • pytest passes cleanly: 21 / 21.
  • .dockerignore keeps image slim (excludes sft_model/ checkpoint, keeps evidence plots).
  • pre_validate.sh + validate-submission.sh for one-command pre-submission smoke tests.
  • LICENSE (MIT) in repo root.

Final submission steps

# Step Status
1 Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) β†’ all artifacts committed βœ…
2 Commit artifacts (reward_curve.png, training_curve.png, reward_components.png, reward_curve_qwen0p5b.png, training_log.json, summary_metrics.json, summary_metrics_qwen0p5b.json) βœ…
3 Update README with real numbers + real Space / Colab / GitHub / blog links βœ…
4 Deploy HF Space from the same commit βœ…
5 Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / checklist links βœ…
6 Blog post updated (docs/BLOG_POST.md) with fixed image paths (raw GitHub URLs) and 0.5B ablation section βœ…
7 All 21 tests passing on latest commit βœ…
8 Run openenv validate remotely against the Space β€” ./validate-submission.sh <space-url> βœ…
9 Submit the Space URL in the hackathon form: https://swapnilpatil28-multi-agent-incident-command-center.hf.space βœ…
10 Do not push commits after the submission deadline β€” post-deadline commits won't be considered βœ…

Pre-submission smoke test (copy-paste)

# 1. HF Space is serving
curl -fsS https://swapnilpatil28-multi-agent-incident-command-center.hf.space/healthz

# 2. Env-info endpoint advertises metadata
curl -s https://swapnilpatil28-multi-agent-incident-command-center.hf.space/env-info

# 3. OpenEnv validator passes remotely
./validate-submission.sh https://swapnilpatil28-multi-agent-incident-command-center.hf.space

# 4. A remote episode works
ENV_URL=https://swapnilpatil28-multi-agent-incident-command-center.hf.space python inference.py

Where the judges will find each artefact

Artefact Primary URL
Live environment (OpenEnv-compatible) swapnilpatil28-multi-agent-incident-command-center.hf.space
Hugging Face Space page Space page β†—
GitHub repository GitHub β†—
README (Part 1 story + Part 2 deep-dive) README.md
Mini blog post (MD file in the repo, renders on both HF Space and GitHub) docs/BLOG_POST.md
Reproducible training notebook Colab β†—
Training evidence (all 4 plots + JSON metrics) artifacts/ folder