- VERGIL — Commitment Dependency Graph Engine
- Quick links (judges start here)
- Submission checklist (hackathon minimums)
- 1. The problem we're targeting
- 2. Why this maps to two hackathon themes
- 3. Reward design (10% of judging)
- 4. Training pipeline (single-L4 on HF Jobs)
- 5. Showing improvement (20% of judging)
- 6. How the env works (judge-friendly walkthrough)
- 7. OpenEnv compliance
- 8. Repository layout
- License
- Quick links (judges start here)
VERGIL — Commitment Dependency Graph Engine
OpenEnv Hackathon · India 2026 — Submission
An OpenEnv-compatible environment that teaches an LLM to manage a graph of interlocking real-world commitments under partial observability, capacity limits and stakeholder trust dynamics.
Themes addressed: #2 (Super) Long-Horizon Planning & Instruction Following · #3.2 Personalized Tasks (Executive Assistant)
Quick links (judges start here)
| Material | Link |
|---|---|
| Live demo (this Space) | https://huggingface.co/spaces/thekrishdshah/vergil-sota-trainer |
| Training dashboard | https://huggingface.co/spaces/thekrishdshah/vergil-sota-trainer/training |
| Trained model + plots + logs | https://huggingface.co/thekrishdshah/vergil-sota-trainer |
| Combined training curves PNG | https://huggingface.co/thekrishdshah/vergil-sota-trainer/blob/main/plots/training_curves.png |
| Eval comparison plots | https://huggingface.co/thekrishdshah/vergil-sota-trainer/tree/main/eval/eval_compare/plots |
| Source code (GitHub) | https://github.com/krishdshah/vergil |
| Reward function | vergil/agent/rewards.py |
| Training script (Colab-runnable) | scripts/train_vergil_sota.py |
| Eval harness | scripts/eval_vergil.py |
| OpenEnv manifest | openenv.yaml |
(Blog / 2-min video links — drop them here once recorded.)
Submission checklist (hackathon minimums)
- Uses OpenEnv — env subclasses
gymnasium.Env, exposesreset / step / state;openenv.yamldeclared; no reserved tool names used. POMDP wrapper invergil/core/pomdp.py. - Training script using HF TRL —
scripts/train_vergil_sota.py(TRLSFTTrainer+GRPOTrainer, runnable in Colab or HF Jobs). - Hosted on Hugging Face Spaces — this Space (
thekrishdshah/vergil-sota-trainer). - Trained, with reward + loss plots from a real run — pushed to the model repo; also embedded on the
/trainingdashboard. - README explains problem, env, results — this file +
/training. - Mini-blog or < 2-min video — attach link here.
1. The problem we're targeting
Most RL-for-LLM environments score each task in isolation. Real personal assistants (and most professional schedulers) live in a commitment dependency graph (CDG) where:
- accepting a task changes the feasibility of every other task,
- promises break when a prerequisite slips, cascading through downstream edges,
- counter-proposing a deadline can save trust at the cost of completion-rate,
- and "do nothing" is sometimes the optimal action — but only when the schedule is genuinely blocked.
VERGIL captures these tensions in an OpenEnv-compatible Gym-style env. The agent's observation is a partially-observable view of the CDG plus a trust score per stakeholder. Its action space is
{accept, decline, counter_propose, do_nothing} × node_id
2. Why this maps to two hackathon themes
Theme #2 — Long-horizon planning. Each commitment has prerequisites, deadlines and durations that ripple downstream. The agent must reason about trajectories tens of steps long, where a single bad accept early on cascades through the graph and tanks fulfillment many steps later. The reward is sparse in time (final fulfillment) but rich in shape (trust deltas, feasibility) — exactly the deep, multi-step reasoning with sparse/delayed rewards the theme calls out.
Theme #3.2 — Personalized tasks. The agent is your over-committed self, managing real-world delegations: dinner conflicts, work overlap, vendor deadlines. We embed it as a backend so it could plug into a real EA-style product.
3. Reward design (10% of judging)
The v1 model collapsed to always-accept because the original reward was
trivially gameable. The v2 reward in
vergil/agent/rewards.py adds three correctives:
R(s, a) = R_env(s, a) # honest signal
+ λ_fmt · 1[parseable JSON, valid action label, target ∈ pending]
+ λ_cap · CapacityPressure(s, a) # shaping
+ λ_div · GroupDiversity(a; group) # anti-collapse
- Format penalty — agent's response must be parseable JSON with a valid
action label and a
target∈ the PENDING set. - Capacity-pressure shaping — pushes
decline/counter_proposewhen accepting would break the calendar's 85% buffer. - Group-diversity bonus — within each GRPO group of N rollouts of the same prompt, under-represented actions get a small bonus so the advantage estimator can't lock in a degenerate policy.
4. Training pipeline (single-L4 on HF Jobs)
| Phase | Method | Purpose |
|---|---|---|
| A | SFT (LoRA r=32, α=64) on expert-oracle data | Non-degenerate prior over all 4 actions |
| B | GRPO with the hardened reward above | Refine the policy under capacity pressure |
| C | Eval on 12 hand-crafted scenarios + 8 curriculum episodes | Heuristic baseline vs. trained, same RNG |
| D | Push everything to the model repo | Adapter + plots + logs + tensorboard + eval |
The training script is scripts/train_vergil_sota.py.
The job entrypoint is scripts/hf_job_train.sh.
Re-launch with:
# from a clean checkout, with $HF_TOKEN exported
python scripts/hf_jobs_launch.py \
--flavor l4x1 --skip-eval 0 \
--grpo-steps 80 --num-generations 4 --max-completion 256
The launcher auto-detects 1× vs. 4× L4 — multi-GPU uses
accelerate launch with configs/accelerate_4xL4.yaml.
5. Showing improvement (20% of judging)
All plots and metrics are pushed to the model repo at the end of every
training run. They are mirrored on this Space's /training
dashboard, which auto-refreshes every 60 s while a job is active:
- Combined training curves (
plots/training_curves.png) — SFT loss + GRPO reward + reward-component decomposition + action-distribution share over time, all on one image. The single image to start with. - 6-panel GRPO dashboard (
plots/grpo_dashboard.png) — mean reward, policy loss, KL, learning rate, components, action share. - SFT loss / GRPO reward / GRPO KL as separate close-ups.
- Eval comparison plots (
eval/eval_compare/plots/) — trained vs. heuristic, side-by-side: per-scenario cumulative reward, action distribution, schedule-satisfiability curve on thesimultaneous_infeasibilityscenario.
6. How the env works (judge-friendly walkthrough)
The interactive demo on this Space lets a judge:
- Pick one of 12 hand-crafted scenarios (or a fresh curriculum draw).
- Watch the trained agent decide, with full reasoning visible.
- Toggle between the trained LoRA agent and the heuristic baseline to confirm the policy actually learned something non-trivial.
- Inspect trust scores, capacity pressure, and the live CDG.
Try it locally:
git clone https://huggingface.co/spaces/thekrishdshah/vergil-sota-trainer vergil
cd vergil
pip install -r requirements-space.txt
VERGIL_MODEL_PATH=thekrishdshah/vergil-sota-trainer python app.py
# open http://localhost:7860
7. OpenEnv compliance
- Env subclasses
gymnasium.Env, exposesreset / step / statecleanly. - POMDP wrapper in
vergil/core/pomdp.pyproduces partial observations. - Reward returned as scalar
floatper step; rich diagnostics ininfo. openenv.yamldeclares the env; the demo Space is the discoverable URL.- No reserved tool names used.
8. Repository layout
vergil/ # the env + agent code
core/ # CDG, env, POMDP wrapper, types
agent/ # prompt formatting, reward function (key file)
curriculum/ # 4-stage curriculum + failure-topology DB
api/ # FastAPI server (powers this Space)
scripts/
sft_data_generator.py # expert-oracle rollouts → SFT data
train_vergil_sota.py # SFT + GRPO + push (TRL)
eval_vergil.py # heuristic vs. trained eval harness
hf_job_train.sh # job entrypoint (used by HF Jobs)
hf_jobs_launch.py # local-side job submitter
scenarios/ # 12 hand-crafted JSON scenarios
configs/ # accelerate config (4× L4 DDP)
frontend/ # built React UI (served by FastAPI)
frontend-react/ # React source
app.py # demo Space entrypoint
Dockerfile # demo Space container
License
Apache-2.0.