exec-assist / README.md
DevanshuDon's picture
Update README.md
c60efab verified
metadata
title: ExecAssist
emoji: πŸ“§
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
  - openenv
  - rl
  - executive-assistant
  - grpo
  - trl

ExecAssist

A 0.5B-parameter model, trained for 90 minutes on a free Colab T4, beats an untuned 120B-parameter frontier model on this environment by 2.4Γ—. Built for the OpenEnv Hackathon (April 2026), Theme #3.2: Personalized Tasks.

ExecAssist is an OpenEnv environment where AI agents learn to manage email and calendar the way a human executive assistant does. The agent reads incoming requests, writes professional replies, finds calendar slots that don't clash, and proposes alternatives when they do. Three tasks at increasing difficulty, three independent reward graders, and four anti-reward-hacking penalties that fired during real training (we have logs).

Live environment: https://devanshudon-exec-assist.hf.space
Mini-blog: https://devanshudon-exec-assist.hf.space/blog
Training notebook: train_colab.ipynb Video walkthrough: Watch on YouTube


πŸ† Headline result

Trained Qwen2.5-0.5B-Instruct with TRL GRPO for 3 epochs (270 steps, ~90 min on free Colab T4):

Task Baseline (untrained 0.5B) Trained (GRPO) Improvement
Easy 0.345 0.995 +188%
Medium 0.227 0.745 +228%
Hard 0.249 0.737 +196%

Nine out of ten samples on the easy task scored a perfect 1.0 after training. The model learned the structure of the task, not just statistics. As a separate sanity check, we ran an untuned Nemotron 120B through the standard inference.py baseline (via OpenRouter) and it scored 0.337 average across the same three tasks. After 90 minutes of GRPO, a model 240Γ— smaller is hitting 0.83 average on the same environment.

Training results: top panel shows reward curve with 10-step and 30-step moving averages, Q1 mean 0.390 to Q4 mean 0.648; bottom-left shows baseline vs trained per-task with error bars and improvement percentages; bottom-right shows reward variance decreasing as the policy converges

Top: GRPO reward over 270 steps with moving averages and quartile mean reference lines. Bottom-left: baseline vs. trained, n=10 per task, error bars are standard deviation. Bottom-right: reward variance over training as a convergence proxy. Lower variance means the policy is stabilizing.


Why this environment exists

Three specific capability gaps motivated this build.

1. Frontier LLMs are bad at structured calendar reasoning. Ask any production agent built on a 100B+ model to find a 30-minute slot next week that doesn't conflict with your standups and is during working hours. Watch it fail. The reasoning is short, the spec is precise, and the failure modes are interesting. ExecAssist isolates this failure mode into something tractable: the scheduling-correctness grader checks four hard constraints (no double-booking, within working hours, appropriate duration, all participants included). The trained model goes from satisfying about 25% of those to about 95%.

2. Multi-objective rewards are where reward hacking actually happens. A single scalar reward like "the user was happy" gets gamed in obvious ways. A weighted sum of multiple independent graders plus named penalties is much harder to game, but only if you actually verify it. We have direct evidence from the training logs that GRPO tried to hack four different reward signals: outputting JSON only with no email body, scheduling outside working hours, using generic templated phrasing, and missing meeting details entirely. Each penalty fired during early steps and disappeared as training progressed. Most submissions claim "anti-hacking penalties." Few can show them firing.

3. Small RL'd model beats large untuned model, on a real task, in 90 minutes, on free hardware. The 240Γ— compute ratio between Qwen-0.5B and Nemotron-120B is the headline. The deeper claim is that task-specific RL with composable rewards is a real path to deploying small models on structured personal-task workflows. That's a workshop-paper-shaped argument, and ExecAssist is a clean demonstration of it.


Tasks

Task Difficulty Description Reward weighting
Easy 1 email, clear availability Draft polite reply, book meeting in open slot 50% email + 50% scheduling
Medium 1 email, calendar conflict Identify conflict, propose 2–3 alternatives, explain professionally 30% email + 40% conflict + 30% scheduling
Hard 3 emails, multi-party coordination, priority conflicts Prioritize, reschedule, notify all parties 34% email + 33% scheduling + 33% conflict

All scores are deterministic and bounded to [0, 1]. Scenarios are randomized at every /reset call.


Environment design

Observation space

{
  "task": "easy" | "medium" | "hard",
  "description": str,
  "emails": [{"sender", "subject", "body", "priority", "timestamp"}, ...],
  "calendar": {
    "existing_meetings": [{"id", "participants", "start_time", "end_time", "subject", "priority"}, ...],
    "working_hours": {"monday": "9-17", ...},
    "executive_name": str
  },
  "contacts": {email: {"name", "email", "timezone", "title"}, ...},
  "action_required": str
}

Action space

{
  "email_reply": str,
  "calendar_action": "book" | "propose_alternatives" | "reschedule" | "decline",
  "meeting_details": {
    "participants": [str, ...],
    "start_time": "ISO-8601",
    "end_time": "ISO-8601",
    "subject": str,
    "location": str | None,
    "proposed_alternatives": [...] | None
  }
}

Reward function (multiple independent graders)

Component Range What it checks
Email quality 0–1 Politeness markers, greeting/closing, sufficient detail (20+ words), professional tone, optional LLM-judge for nuance
Scheduling correctness 0–1 No double-booking, within working hours, appropriate duration (15min to 2hrs), all participants included
Conflict resolution 0–1 Recognizes conflicts, proposes 2–3 alternatives, explains professionally, prioritizes correctly

Anti-reward-hacking penalties

  • Short email (< 20 words): βˆ’0.30
  • Missing meeting_details: βˆ’0.40
  • Generic / templated phrasing: βˆ’0.10
  • Overly long email (> 1500 chars): βˆ’0.15

These are here because GRPO will absolutely find shortcuts if you leave them open. During training the model briefly collapsed to a single short safe response. The penalties plus KL regularization fixed it cleanly.

A note on rubric design. The reward is composed from independent scoring functions, one per dimension (email quality, scheduling correctness, conflict resolution), plus four named penalty checks. Each function returns a value in [0, 1] (or a negative penalty) and is mixed by the task-specific weighting shown in the Tasks table. Structurally this is a composable rubric. Any individual grader can be swapped, reweighted, or audited in isolation.

We checked at submission time whether OpenEnv exposes a Rubric base class to subclass directly. Running from openenv import Rubric against the published openenv-core package raises ImportError, so the class isn't available yet. The plain-Python implementation here produces the same composable, auditable behavior at the function level.


API endpoints

Endpoint Method Description
/reset?task=easy|medium|hard POST Start new episode, returns observation
/step POST Submit action, returns observation/reward/done/info
/state GET Current state
/tasks GET List all tasks
/health GET Health check
/metadata GET Environment info
/schema GET Action / observation / state schemas

Full interactive docs: https://devanshudon-exec-assist.hf.space/docs


Setup & usage

Run the environment locally

git clone https://huggingface.co/spaces/DevanshuDon/exec-assist
cd exec-assist
pip install -r requirements.txt
uvicorn server.app:app --port 8000
# open http://127.0.0.1:8000/docs

Reproduce the baseline

export APIBASEURL=https://openrouter.ai/api/v1
export MODELNAME=nvidia/nemotron-3-super-120b-a12b:free
export HFTOKEN=your-openrouter-key
python inference.py

Expected output (structured [START] / [STEP] / [END] logs as required):

[START] task=easy env=exec-assist model=...
[STEP] step=1 action=assistant(easy) reward=0.32 done=true error=null
[END] success=false steps=1 score=0.315 rewards=0.32

Run the trained model

Open train_colab.ipynb in Google Colab, set runtime to T4 GPU, then Run All. Total time around 50 minutes including evaluation. Outputs training_results.png and results.json.

Docker

docker build -t exec-assist .
docker run -p 7860:7860 exec-assist

Training pipeline

Stack: TRL GRPOTrainer plus HuggingFace Transformers, Qwen2.5-0.5B-Instruct, free Colab T4.

Approach: pre-collect 90 scenarios from the deployed HF Space (30 each across easy / medium / hard) into a Dataset, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency and means the loop runs at GPU speed instead of HTTP speed.

Hyperparameters (the version that actually works):

GRPOConfig(
    learning_rate=1e-6,           # critical, 5e-6 caused collapse
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_generations=8,            # diversity within group
    num_train_epochs=3,
    beta=0.1,                     # KL penalty, prevents mode collapse
    fp16=False, bf16=False,       # fp32 for stable gradients
    gradient_checkpointing=True,
)

The collapse and the fix. First run (1 epoch, lr=5e-6, no beta) collapsed hard. Trained model scored exactly 0.2 on every prompt regardless of input. The model had found a single safe response and stopped exploring. Adding the KL term (beta=0.1), dropping the learning rate by 5Γ—, and bumping num_generations from 4 to 8 produced the clean training curve shown above.

Anti-reward-hacking observations during training. GRPO went after several signals before the penalties pinned it down. It outputted JSON without an email body (caught by the short-email penalty). It proposed booking times outside working hours (caught by the scheduling check). It repeated the prompt back as a "reply" (caught by the generic-phrasing detector). Each penalty fired during early steps and disappeared as training progressed. That's what a well-designed multi-grader rubric is supposed to do.


Architecture note

The environment is implemented as a FastAPI application that exposes the OpenEnv-spec endpoints (/reset, /step, /state, /tasks, /health, /metadata, /schema) directly. We checked whether OpenEnv exposes an Environment base class to subclass: running from openenv import Environment against the published openenv-core package raises ImportError, so the class isn't available yet. FastAPI gives us complete control over the JSON-over-HTTP interface, which is what the spec actually requires.

The client (client.py) does extend openenv.EnvClient (which IS exposed in the published package) and provides the standard Gym-style typed interface. Any code that uses an EnvClient to talk to this Space will work without modification. Client/server separation is preserved. The client imports typed models only, never server internals.


Repository structure

exec-assist/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py            # FastAPI app + environment logic + landing page + blog endpoint
β”‚   β”œβ”€β”€ models.py         # Pydantic Action/Observation/State models
β”‚   └── data.py           # Scenario generation, scoring functions, LLM judge
β”œβ”€β”€ client.py             # EnvClient wrapper (Gym-style)
β”œβ”€β”€ inference.py          # Baseline inference (required, structured logs)
β”œβ”€β”€ train_colab.ipynb     # GRPO training notebook
β”œβ”€β”€ training_results.png  # Training curves + baseline-vs-trained
β”œβ”€β”€ results.json          # Raw evaluation data + 270-step training log
β”œβ”€β”€ blog_post.md          # Mini-blog write-up (also live at /blog)
β”œβ”€β”€ openenv.yaml          # OpenEnv manifest
β”œβ”€β”€ Dockerfile            # Python 3.10, port 7860
β”œβ”€β”€ requirements.txt
└── README.md             # This file

Notes for reviewers

A few things worth pointing out for anyone evaluating this:

  • The 270-step training log in results.json is the actual trainer.state.log_history from the run that produced these results, not a curated subset.
  • The inference.py baseline emits the structured [START] / [STEP] / [END] log format the rubric specifies, and reads APIBASEURL / MODELNAME / HFTOKEN as documented. The 0.337 average is reproducible.
  • The training notebook (train_colab.ipynb) ships with the working hyperparameters, not the broken first attempt. lr=1e-6, beta=0.1, 3 epochs. Anyone re-running it on a free T4 should land within ~5% of the numbers above.
  • The Dockerfile builds cleanly from a fresh clone (verified). Python 3.10 because openenv-core>=0.2.0 requires it.
  • GRPO loss values were logged to TensorBoard during training but weren't exported to the published results.json because of Colab session limits. The reward signal is the primary training metric for RLVR-style training (which GRPO is), and the variance plot in the figure above serves as a convergence diagnostic showing the policy stabilizing over time.
  • Architecture decisions and tradeoffs (FastAPI-direct vs. Environment base class, plain Python vs. Rubric class) are discussed in the two architecture notes above. Both base classes were verified to not be exposed in the published openenv-core package at submission time.
  • The training notebook ships with report_to='wandb' enabled for experiment tracking. Run wandb login once before executing the training cell to log the run to your W&B account. Loss, reward, KL, and gradient norms are all tracked there in real time. The original training session's W&B run wasn't retained due to Colab session limits, but anyone re-running the notebook will get a fresh tracked run.

Author

Devanshu (@DevanshuDon). Built for OpenEnv Hackathon, April 2026.