VERGIL — Commitment Dependency Graph Engine

OpenEnv Hackathon · India 2026 — Submission

An OpenEnv-compatible environment that teaches an LLM to manage a graph of interlocking real-world commitments under partial observability, capacity limits and stakeholder trust dynamics.

Themes addressed: #2 (Super) Long-Horizon Planning & Instruction Following · #3.2 Personalized Tasks (Executive Assistant)

Quick links (judges start here)

Material	Link
Live demo (this Space)	https://huggingface.co/spaces/thekrishdshah/vergil-sota-trainer
Training dashboard	https://huggingface.co/spaces/thekrishdshah/vergil-sota-trainer/training
Trained model + plots + logs	https://huggingface.co/thekrishdshah/vergil-sota-trainer
Combined training curves PNG	https://huggingface.co/thekrishdshah/vergil-sota-trainer/blob/main/plots/training_curves.png
Eval comparison plots	https://huggingface.co/thekrishdshah/vergil-sota-trainer/tree/main/eval/eval_compare/plots
Source code (GitHub)	https://github.com/krishdshah/vergil
Reward function	vergil/agent/rewards.py
Training script (Colab-runnable)	scripts/train_vergil_sota.py
Eval harness	scripts/eval_vergil.py
OpenEnv manifest	openenv.yaml

(Blog / 2-min video links — drop them here once recorded.)

Submission checklist (hackathon minimums)

Uses OpenEnv — env subclasses gymnasium.Env, exposes reset / step / state; openenv.yaml declared; no reserved tool names used. POMDP wrapper in vergil/core/pomdp.py.
Training script using HF TRL — scripts/train_vergil_sota.py (TRL SFTTrainer + GRPOTrainer, runnable in Colab or HF Jobs).
Hosted on Hugging Face Spaces — this Space (thekrishdshah/vergil-sota-trainer).
Trained, with reward + loss plots from a real run — pushed to the model repo; also embedded on the /training dashboard.
README explains problem, env, results — this file + /training.
Mini-blog or < 2-min video — attach link here.

1. The problem we're targeting

Most RL-for-LLM environments score each task in isolation. Real personal assistants (and most professional schedulers) live in a commitment dependency graph (CDG) where:

accepting a task changes the feasibility of every other task,
promises break when a prerequisite slips, cascading through downstream edges,
counter-proposing a deadline can save trust at the cost of completion-rate,
and "do nothing" is sometimes the optimal action — but only when the schedule is genuinely blocked.

VERGIL captures these tensions in an OpenEnv-compatible Gym-style env. The agent's observation is a partially-observable view of the CDG plus a trust score per stakeholder. Its action space is

{accept, decline, counter_propose, do_nothing} × node_id

2. Why this maps to two hackathon themes

Theme #2 — Long-horizon planning. Each commitment has prerequisites, deadlines and durations that ripple downstream. The agent must reason about trajectories tens of steps long, where a single bad accept early on cascades through the graph and tanks fulfillment many steps later. The reward is sparse in time (final fulfillment) but rich in shape (trust deltas, feasibility) — exactly the deep, multi-step reasoning with sparse/delayed rewards the theme calls out.

Theme #3.2 — Personalized tasks. The agent is your over-committed self, managing real-world delegations: dinner conflicts, work overlap, vendor deadlines. We embed it as a backend so it could plug into a real EA-style product.

3. Reward design (10% of judging)

The v1 model collapsed to always-accept because the original reward was trivially gameable. The v2 reward in vergil/agent/rewards.py adds three correctives:

R(s, a) = R_env(s, a)                                     # honest signal
        + λ_fmt · 1[parseable JSON, valid action label, target ∈ pending]
        + λ_cap · CapacityPressure(s, a)                  # shaping
        + λ_div · GroupDiversity(a; group)                # anti-collapse

Format penalty — agent's response must be parseable JSON with a valid action label and a target ∈ the PENDING set.
Capacity-pressure shaping — pushes decline / counter_propose when accepting would break the calendar's 85% buffer.
Group-diversity bonus — within each GRPO group of N rollouts of the same prompt, under-represented actions get a small bonus so the advantage estimator can't lock in a degenerate policy.

4. Training pipeline (single-L4 on HF Jobs)

Phase	Method	Purpose
A	SFT (LoRA r=32, α=64) on expert-oracle data	Non-degenerate prior over all 4 actions
B	GRPO with the hardened reward above	Refine the policy under capacity pressure
C	Eval on 12 hand-crafted scenarios + 8 curriculum episodes	Heuristic baseline vs. trained, same RNG
D	Push everything to the model repo	Adapter + plots + logs + tensorboard + eval

The training script is scripts/train_vergil_sota.py. The job entrypoint is scripts/hf_job_train.sh. Re-launch with:

# from a clean checkout, with $HF_TOKEN exported
python scripts/hf_jobs_launch.py \
  --flavor l4x1 --skip-eval 0 \
  --grpo-steps 80 --num-generations 4 --max-completion 256

The launcher auto-detects 1× vs. 4× L4 — multi-GPU uses accelerate launch with configs/accelerate_4xL4.yaml.

5. Showing improvement (20% of judging)

All plots and metrics are pushed to the model repo at the end of every training run. They are mirrored on this Space's /training dashboard, which auto-refreshes every 60 s while a job is active:

Combined training curves (plots/training_curves.png) — SFT loss + GRPO reward + reward-component decomposition + action-distribution share over time, all on one image. The single image to start with.
6-panel GRPO dashboard (plots/grpo_dashboard.png) — mean reward, policy loss, KL, learning rate, components, action share.
SFT loss / GRPO reward / GRPO KL as separate close-ups.
Eval comparison plots (eval/eval_compare/plots/) — trained vs. heuristic, side-by-side: per-scenario cumulative reward, action distribution, schedule-satisfiability curve on the simultaneous_infeasibility scenario.

6. How the env works (judge-friendly walkthrough)

The interactive demo on this Space lets a judge:

Pick one of 12 hand-crafted scenarios (or a fresh curriculum draw).
Watch the trained agent decide, with full reasoning visible.
Toggle between the trained LoRA agent and the heuristic baseline to confirm the policy actually learned something non-trivial.
Inspect trust scores, capacity pressure, and the live CDG.

Try it locally:

git clone https://huggingface.co/spaces/thekrishdshah/vergil-sota-trainer vergil
cd vergil
pip install -r requirements-space.txt
VERGIL_MODEL_PATH=thekrishdshah/vergil-sota-trainer python app.py
# open http://localhost:7860

7. OpenEnv compliance

Env subclasses gymnasium.Env, exposes reset / step / state cleanly.
POMDP wrapper in vergil/core/pomdp.py produces partial observations.
Reward returned as scalar float per step; rich diagnostics in info.
openenv.yaml declares the env; the demo Space is the discoverable URL.
No reserved tool names used.

8. Repository layout

vergil/                 # the env + agent code
  core/                 # CDG, env, POMDP wrapper, types
  agent/                # prompt formatting, reward function (key file)
  curriculum/           # 4-stage curriculum + failure-topology DB
  api/                  # FastAPI server (powers this Space)
scripts/
  sft_data_generator.py # expert-oracle rollouts → SFT data
  train_vergil_sota.py  # SFT + GRPO + push (TRL)
  eval_vergil.py        # heuristic vs. trained eval harness
  hf_job_train.sh       # job entrypoint (used by HF Jobs)
  hf_jobs_launch.py     # local-side job submitter
scenarios/              # 12 hand-crafted JSON scenarios
configs/                # accelerate config (4× L4 DDP)
frontend/               # built React UI (served by FastAPI)
frontend-react/         # React source
app.py                  # demo Space entrypoint
Dockerfile              # demo Space container

License

Apache-2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

thekrishdshah
/

vergil-sota-trainer