VERGIL — Commitment Dependency Graph Engine

OpenEnv Hackathon · India 2026 — Submission

An OpenEnv-compatible environment that teaches an LLM to manage a graph of interlocking real-world commitments under partial observability, capacity limits and stakeholder trust dynamics.

Themes addressed: #2 (Super) Long-Horizon Planning & Instruction Following  ·  #3.2 Personalized Tasks (Executive Assistant)

Quick links (judges start here)

(Blog / 2-min video links — drop them here once recorded.)

Submission checklist (hackathon minimums)

  • Uses OpenEnv — env subclasses gymnasium.Env, exposes reset / step / state; openenv.yaml declared; no reserved tool names used. POMDP wrapper in vergil/core/pomdp.py.
  • Training script using HF TRLscripts/train_vergil_sota.py (TRL SFTTrainer + GRPOTrainer, runnable in Colab or HF Jobs).
  • Hosted on Hugging Face Spaces — this Space (thekrishdshah/vergil-sota-trainer).
  • Trained, with reward + loss plots from a real run — pushed to the model repo; also embedded on the /training dashboard.
  • README explains problem, env, results — this file + /training.
  • Mini-blog or < 2-min videoattach link here.

1. The problem we're targeting

Most RL-for-LLM environments score each task in isolation. Real personal assistants (and most professional schedulers) live in a commitment dependency graph (CDG) where:

  • accepting a task changes the feasibility of every other task,
  • promises break when a prerequisite slips, cascading through downstream edges,
  • counter-proposing a deadline can save trust at the cost of completion-rate,
  • and "do nothing" is sometimes the optimal action — but only when the schedule is genuinely blocked.

VERGIL captures these tensions in an OpenEnv-compatible Gym-style env. The agent's observation is a partially-observable view of the CDG plus a trust score per stakeholder. Its action space is

{accept, decline, counter_propose, do_nothing} × node_id

2. Why this maps to two hackathon themes

Theme #2 — Long-horizon planning. Each commitment has prerequisites, deadlines and durations that ripple downstream. The agent must reason about trajectories tens of steps long, where a single bad accept early on cascades through the graph and tanks fulfillment many steps later. The reward is sparse in time (final fulfillment) but rich in shape (trust deltas, feasibility) — exactly the deep, multi-step reasoning with sparse/delayed rewards the theme calls out.

Theme #3.2 — Personalized tasks. The agent is your over-committed self, managing real-world delegations: dinner conflicts, work overlap, vendor deadlines. We embed it as a backend so it could plug into a real EA-style product.

3. Reward design (10% of judging)

The v1 model collapsed to always-accept because the original reward was trivially gameable. The v2 reward in vergil/agent/rewards.py adds three correctives:

R(s, a) = R_env(s, a)                                     # honest signal
        + λ_fmt · 1[parseable JSON, valid action label, target ∈ pending]
        + λ_cap · CapacityPressure(s, a)                  # shaping
        + λ_div · GroupDiversity(a; group)                # anti-collapse
  • Format penalty — agent's response must be parseable JSON with a valid action label and a target ∈ the PENDING set.
  • Capacity-pressure shaping — pushes decline / counter_propose when accepting would break the calendar's 85% buffer.
  • Group-diversity bonus — within each GRPO group of N rollouts of the same prompt, under-represented actions get a small bonus so the advantage estimator can't lock in a degenerate policy.

4. Training pipeline (single-L4 on HF Jobs)

Phase Method Purpose
A SFT (LoRA r=32, α=64) on expert-oracle data Non-degenerate prior over all 4 actions
B GRPO with the hardened reward above Refine the policy under capacity pressure
C Eval on 12 hand-crafted scenarios + 8 curriculum episodes Heuristic baseline vs. trained, same RNG
D Push everything to the model repo Adapter + plots + logs + tensorboard + eval

The training script is scripts/train_vergil_sota.py. The job entrypoint is scripts/hf_job_train.sh. Re-launch with:

# from a clean checkout, with $HF_TOKEN exported
python scripts/hf_jobs_launch.py \
  --flavor l4x1 --skip-eval 0 \
  --grpo-steps 80 --num-generations 4 --max-completion 256

The launcher auto-detects 1× vs. 4× L4 — multi-GPU uses accelerate launch with configs/accelerate_4xL4.yaml.

5. Showing improvement (20% of judging)

All plots and metrics are pushed to the model repo at the end of every training run. They are mirrored on this Space's /training dashboard, which auto-refreshes every 60 s while a job is active:

  • Combined training curves (plots/training_curves.png) — SFT loss + GRPO reward + reward-component decomposition + action-distribution share over time, all on one image. The single image to start with.
  • 6-panel GRPO dashboard (plots/grpo_dashboard.png) — mean reward, policy loss, KL, learning rate, components, action share.
  • SFT loss / GRPO reward / GRPO KL as separate close-ups.
  • Eval comparison plots (eval/eval_compare/plots/) — trained vs. heuristic, side-by-side: per-scenario cumulative reward, action distribution, schedule-satisfiability curve on the simultaneous_infeasibility scenario.

6. How the env works (judge-friendly walkthrough)

The interactive demo on this Space lets a judge:

  1. Pick one of 12 hand-crafted scenarios (or a fresh curriculum draw).
  2. Watch the trained agent decide, with full reasoning visible.
  3. Toggle between the trained LoRA agent and the heuristic baseline to confirm the policy actually learned something non-trivial.
  4. Inspect trust scores, capacity pressure, and the live CDG.

Try it locally:

git clone https://huggingface.co/spaces/thekrishdshah/vergil-sota-trainer vergil
cd vergil
pip install -r requirements-space.txt
VERGIL_MODEL_PATH=thekrishdshah/vergil-sota-trainer python app.py
# open http://localhost:7860

7. OpenEnv compliance

  • Env subclasses gymnasium.Env, exposes reset / step / state cleanly.
  • POMDP wrapper in vergil/core/pomdp.py produces partial observations.
  • Reward returned as scalar float per step; rich diagnostics in info.
  • openenv.yaml declares the env; the demo Space is the discoverable URL.
  • No reserved tool names used.

8. Repository layout

vergil/                 # the env + agent code
  core/                 # CDG, env, POMDP wrapper, types
  agent/                # prompt formatting, reward function (key file)
  curriculum/           # 4-stage curriculum + failure-topology DB
  api/                  # FastAPI server (powers this Space)
scripts/
  sft_data_generator.py # expert-oracle rollouts → SFT data
  train_vergil_sota.py  # SFT + GRPO + push (TRL)
  eval_vergil.py        # heuristic vs. trained eval harness
  hf_job_train.sh       # job entrypoint (used by HF Jobs)
  hf_jobs_launch.py     # local-side job submitter
scenarios/              # 12 hand-crafted JSON scenarios
configs/                # accelerate config (4× L4 DDP)
frontend/               # built React UI (served by FastAPI)
frontend-react/         # React source
app.py                  # demo Space entrypoint
Dockerfile              # demo Space container

License

Apache-2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using thekrishdshah/vergil-sota-trainer 1