Submission Checklist β OpenEnv India 2026 Round 2
Status against every hard gate in the official judging rules, plus every polish item that moves the judging needle. Last verified: all 21 tests passing, HF Space live, all artifacts committed.
Hard gates (from the official rules)
| # | Rule | Status | Evidence |
|---|---|---|---|
| 1 | Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel. | β | requirements.txt pins openenv-core>=0.2.2, openenv.yaml has version: "3.0", server/environment.py extends openenv.core.environment.Environment, app built via openenv.core.env_server.create_fastapi_app. |
| 2 | Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it. | β | train_trl.py uses HF TRL SFTTrainer. One-click Colab notebook β runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
| 3 | Evidence that you actually trained: at minimum, loss and reward plots from a real run. | β | Four plots committed to artifacts/: training_curve.png (loss + token accuracy), reward_curve.png (4-policy reward by tier), reward_components.png (per-component breakdown), plus the 0.5B ablation reward_curve_qwen0p5b.png. Full training_log.json + summary_metrics.json committed alongside. |
| 4 | Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README. | β | Mini-blog lives as docs/BLOG_POST.md β shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. (No separate video submission.) |
| 5 | Push your environment to a Hugging Face Space so it's discoverable and runnable. | β | Live at swapnilpatil28-multi-agent-incident-command-center.hf.space Β· Space page: huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center. |
| 6 | README motivates the problem, explains how the env works, and shows results. | β | README.md β Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
| 7 | README links to the HF Space + all additional materials (blog, slides, etc.). | β | "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
| 8 | Do not include big video files in the HF submission β only public URLs. | β | No video files committed. All assets in artifacts/ are PNG plots (β€ 162 KB each) + JSON. Repo weight is dominated by text and small images. |
Judging-rubric alignment
Environment Innovation (40%)
- Multi-role, multi-agent β
triage_agent,investigator_agent,ops_manager_agentwith non-overlapping permissions (server/domain/roles.py). - Long-horizon β 3β5 sequential incidents per episode, 20β60 steps each, shared SLA + budget counters.
- Professional / enterprise task simulation β realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
- 13 unique incident templates across easy / medium / hard (
server/domain/incidents.py). - Rich observation schema β customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints,
reward_components,last_action_notes. - Composable reward rubric with 14+ named components and anti-gaming safeguards (
server/domain/reward.py). - Tier-weighted business impact (
free Γ0.6 Β· standard Γ1.0 Β· premium Γ1.4 Β· enterprise Γ1.8). - Role-based permissions + handoff scoring (
wrong_actor_penalty,handoff_correct/handoff_wrong).
Storytelling (30%)
- README Part 1 β The story in 2 minutes written in plain English, readable by a non-technical judge in under 3 minutes.
- Every plot has a one-line caption explaining what it shows.
- Blog post
docs/BLOG_POST.mdβ eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping. - Live HF Space dashboard has a "Story in 2 minutes" hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with click-through links (README, blog, checklist, Colab, Space, etc.).
- All documentation cross-links cleanly β README β dashboard β blog post β checklist.
Improvement in Rewards (20%)
- 4-policy reward curve (
reward_curve.png) across easy / medium / hard. - Training loss + token-accuracy curve (
training_curve.png). - Reward-components stacked bar chart (
reward_components.png) β shows where the improvement came from. - Ablation plot (
reward_curve_qwen0p5b.png) for Qwen2.5-0.5B-Instruct backbone. - Per-task
improvement_sft_over_basenumbers insummary_metrics.json: β1.80 / +3.13 / +10.17 (easy / medium / hard). - Final headline run: Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs, 680 rows β full
training_log.jsoncommitted.
Reward & Training Pipeline (10%)
- Reward logic is coherent β rubric engine with module-level constants and unit tests (
tests/test_reward.py). - Training pipeline genuinely connects to the running environment (no static dataset β rollouts collected from live
IncidentCommandCenterEnvironment). - SFT checkpoint is saved to
artifacts/sft_model/and reloaded for 4-policy evaluation β closes the loop. - 21 unit + integration tests passing (
tests/test_reward.py,tests/test_incidents.py,tests/test_environment.py).
Engineering table-stakes
- Uses OpenEnv
Environmentbase class properly. - Clean client/server separation β client only uses Pydantic models + HTTP (
client.py). - Gym-style
reset / step / state+ OpenEnv/close. - Valid
openenv.yamlmanifest (version 3.0). - No reserved MCP tool names.
- Structured JSON logging with per-episode seeded RNG (
server/logging_utils.py). - Health / version / env-info / metrics endpoints (
/healthz,/version,/env-info,/metrics). - Static
/artifactsmount so the Space serves its own plots β no external hotlinking. - Dockerfile with
HEALTHCHECK(Dockerfile,server/Dockerfile). -
pytestpasses cleanly: 21 / 21. -
.dockerignorekeeps image slim (excludessft_model/checkpoint, keeps evidence plots). -
pre_validate.sh+validate-submission.shfor one-command pre-submission smoke tests. - LICENSE (MIT) in repo root.
Final submission steps
| # | Step | Status |
|---|---|---|
| 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) β all artifacts committed | β |
| 2 | Commit artifacts (reward_curve.png, training_curve.png, reward_components.png, reward_curve_qwen0p5b.png, training_log.json, summary_metrics.json, summary_metrics_qwen0p5b.json) |
β |
| 3 | Update README with real numbers + real Space / Colab / GitHub / blog links | β |
| 4 | Deploy HF Space from the same commit | β |
| 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / checklist links | β |
| 6 | Blog post updated (docs/BLOG_POST.md) with fixed image paths (raw GitHub URLs) and 0.5B ablation section |
β |
| 7 | All 21 tests passing on latest commit | β |
| 8 | Run openenv validate remotely against the Space β ./validate-submission.sh <space-url> |
β |
| 9 | Submit the Space URL in the hackathon form: https://swapnilpatil28-multi-agent-incident-command-center.hf.space |
β |
| 10 | Do not push commits after the submission deadline β post-deadline commits won't be considered | β |
Pre-submission smoke test (copy-paste)
# 1. HF Space is serving
curl -fsS https://swapnilpatil28-multi-agent-incident-command-center.hf.space/healthz
# 2. Env-info endpoint advertises metadata
curl -s https://swapnilpatil28-multi-agent-incident-command-center.hf.space/env-info
# 3. OpenEnv validator passes remotely
./validate-submission.sh https://swapnilpatil28-multi-agent-incident-command-center.hf.space
# 4. A remote episode works
ENV_URL=https://swapnilpatil28-multi-agent-incident-command-center.hf.space python inference.py
Where the judges will find each artefact
| Artefact | Primary URL |
|---|---|
| Live environment (OpenEnv-compatible) | swapnilpatil28-multi-agent-incident-command-center.hf.space |
| Hugging Face Space page | Space page β |
| GitHub repository | GitHub β |
| README (Part 1 story + Part 2 deep-dive) | README.md |
| Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | docs/BLOG_POST.md |
| Reproducible training notebook | Colab β |
| Training evidence (all 4 plots + JSON metrics) | artifacts/ folder |