Spaces:
Running
train/ β SFT + GRPO Training Pipeline
This directory holds the training notebooks for the AWS RL agent. Heavy logic for the GRPO loop lives at the repo root in train_grpo.py; the notebooks here are thin drivers that you can run end-to-end on Colab.
The training pipeline has two stages:
βββββββββββ data/sft/ βββββββββββ
β 1,500 train Β· 150 val rows β
β 5 trajectory types β
βββββββββββββββββ¬ββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β STAGE 1 β Supervised Fine-Tuning (train_sft_lora.ipynb) β
β Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) β SFT adapter β
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β Sizzing/aws-rl-sft-qwen25coder3b-adapter
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β STAGE 2 β GRPO RL (train_grpo_lora.ipynb) β
β G=8 parallel rollouts Β· multi-turn Β· reward = env return β
β Optuna over (lr, Ξ², G, T, top_p, lora_r, max_turns) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The two stages are intentionally separable: the SFT adapter is published to the Hugging Face Hub so anyone can pull it and start GRPO without re-running SFT.
Table of contents
- SFT stage β supervised LoRA
- GRPO stage β reinforcement learning
- Optuna hyperparameter search
- Multi-turn rollouts + parallel envs
- Training modes (CLI)
- How to run
- Logging and artifacts
- Reproducing results
- Files in this directory
1. SFT stage β supervised LoRA
train/train_sft_lora.ipynb β primary SFT notebook.
Why SFT before GRPO?
Two reasons β both showed up in our base-model evaluation (data/sft/MODEL_EVALUATION.md):
- Format-locking. Even strong coder models occasionally wrap commands in markdown fences or quotes. SFT removes that surface noise in one epoch.
- Bootstrap the GRPO reward signal. GRPO with a base model that's only 41% exact-match starts from a low-density reward landscape. Pre-training on canonical commands raises the baseline so GRPO can spend its compute on optimization, not search.
Base model
| Choice | unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit |
|---|---|
| Why | Highest exact-match (41%) of 11 candidates we benchmarked, fastest viable inference (3.1 s/call), tightest output (86 chars). Full reasoning in data/sft/MODEL_EVALUATION.md. |
| Loader | Unsloth's 4-bit quantized variant β fits comfortably on a single 24 GB GPU, 2Γ faster training kernels |
LoRA config
LoraConfig(
r = trial.suggest_categorical("lora_r", [8, 16, 32]),
lora_alpha = r * trial.suggest_categorical("lora_alpha_mul", [1, 2, 4]),
lora_dropout = trial.suggest_float("lora_dropout", 0.005, 0.031),
bias = "none",
task_type = "CAUSAL_LM",
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
)
- Only attention projections are adapted β MLP / output heads stay frozen, keeping the trainable parameter count tiny (~10β40 M depending on rank).
lora_alpha = r Γ multiplierkeeps the effective scaling stable across rank variations during the Optuna search.
Optimization
| Hyperparameter | Value / Range |
|---|---|
| Optimizer | AdamW (Unsloth's fused implementation) |
| Learning rate | [1e-4, 5e-4] log-scale (Optuna) |
| Schedule | Cosine annealing |
| Warmup ratio | {0.03, 0.1} (Optuna; best 0.1) |
| Batch size | 2 per GPU |
| Epochs | 2 |
| Max sequence length | 512 |
| Packing | Disabled (we keep chat-template separators intact) |
| Loss masking | Assistant-only (user message tokens are masked from the loss) |
Dataset
data/sft/aws_rl_sft.train.jsonl β 1,500 examples. Format:
{
"messages": [
{"role": "system", "content": "You are an AWS cloud engineer..."},
{"role": "user", "content": "TASK: ...\n\nCURRENT OBSERVATION:\nProgress: 0.00 ..."},
{"role": "assistant", "content": "aws s3 mb s3://my-app-data"}
],
"difficulty": "intermediate",
"source": "success_first_step",
"task_id": 42
}
The dataset is a careful mix of 5 trajectory types (success, multi-step continuation, failure recovery, verification, hint usage). Full generation methodology in data/README.md.
Training graphs
A reference SFT run achieved validation loss 0.052 after 188 training steps with the best Optuna trial. The plots below were exported from that run into docs/figures/.
2. GRPO stage β reinforcement learning
The core trainer lives at train_grpo.py (1,283 LOC). Notebooks call into it:
- train/train_grpo_lora.ipynb β clean
- train/train_grpo_lora_with_outputs.ipynb β with execution outputs preserved
- aws_rl_env_colab.ipynb β Colab driver wrapping the entire pipeline
What GRPO is, briefly
GRPO (Group Relative Policy Optimization) is the algorithm introduced by DeepSeekMath and adopted by TRL β₯ 0.18. Unlike PPO, GRPO does not train a critic. Instead:
- For one prompt (here, one curriculum-picked task), generate
Gcompletions - Score each with the reward function(s)
- Compute group-relative advantage:
(reward_i β group_mean) / group_std - Backpropagate the policy gradient with that advantage
- Apply a KL penalty to the SFT reference model (coefficient
Ξ²) to prevent drift
This is dramatically simpler than PPO (no value head, no GAE), more sample-efficient for verifier-style rewards, and a natural fit for our setup β the AWS RL env is the reward function.
TRL GRPOTrainer config
From train_grpo.py:_build_grpo_config():
| Parameter | Default value | Notes |
|---|---|---|
learning_rate |
5e-6 |
Optuna range [1e-6, 1e-4] log-scale |
beta (KL coefficient) |
0.04 |
Optuna range [0.0, 0.1] |
num_generations (G) |
8 |
Optuna {4, 8} |
temperature |
0.9 |
Optuna [0.7, 1.0] |
top_p |
0.95 |
Optuna [0.85, 0.98] |
per_device_train_batch_size |
1 |
|
gradient_accumulation_steps |
8 |
Effective batch 8 |
gradient_checkpointing |
True |
use_reentrant=False β VRAM optimization |
max_completion_length |
256 |
Per-turn; one AWS CLI command fits comfortably |
max_prompt_length |
2048 |
Holds task + history + observation |
loss_type |
"dapo" |
Distributional Advantage Policy Optimization (TRL default for GRPO) |
mask_truncated_completions |
True |
Drop samples that hit max_completion_length |
warmup_ratio |
0.05 |
|
lr_scheduler_type |
"cosine" |
|
max_grad_norm |
1.0 |
|
use_vllm |
False |
Plain model.generate() β vLLM integration is future work |
Reward functions (TRL convention)
Three reward functions are registered, summed by GRPO:
reward_funcs=[reward_task, reward_achieved, reward_progress]
reward_task(completions, **kwargs)β episode return (sum of per-step env rewards). The dominant signal.reward_achieved(completions, **kwargs)β 1.0 iftask.task_achievedat end of episode, else 0.0. Sparse but unambiguous.reward_progress(completions, **kwargs)β finalpartial_progressβ [0, 1]. Densifies the credit assignment for partial completions.
The env's reward shaping (see server/README.md Β§8) does most of the work β these three TRL functions are a thin faΓ§ade.
Episode = one rollout
- Each rollout runs up to
MAX_TURNS=6sequential AWS CLI commands - Each command's stdout/stderr/progress is fed back as the user message for the next turn (see
build_user_prompt()andformat_observation()in train_grpo.py) - The episode terminates on
task_achieved, max turns, ormax_total_tokens(per-episode token budget) - Token sequences (prompt_ids, completion_ids, logprobs) are accumulated across turns, so GRPO assigns the episode-level reward to the full multi-turn token sequence β not just the last turn
Curriculum integration
trainer step:
1. task = curriculum.next_task() # one task per GRPO step
2. results = pool.run_group(task, ...) # G rollouts on that task
3. mean_r = sum(group_rewards) / G
4. curriculum.record_result(task, achieved=any_achieved, reward=mean_r)
5. trainer applies group-relative advantages # standard GRPO
The curriculum drives task selection β every rollout in a group runs the same task, forced through env.reset(task=task). This matches GRPO's group-relative semantics (you need the same prompt across the group to compute baseline correctly).
Full curriculum mechanics (priority scoring, mastery, spaced rep, tier promotion) live in server/README.md Β§7.
Training graphs
A reference GRPO run trained 35 steps with the best Optuna config (lr=1.6e-5, Ξ²=0.0021, T=0.99). Per-step training signals (extracted from the run's trainer_state.json) are mirrored into docs/figures/:
Notable signals from the run:
env_reward/mean |
0.31 (mean over 16 reward-logged steps), max 0.94, min 0.13 |
kl |
0.15 (mean) β KL stays small despite tiny Ξ² |
completion_length |
87 tokens (mean) β agent emits compact AWS CLI commands |
| Format compliance | 100% (format_reward/mean = 1.0 every step) |
Multi-step end-to-end re-eval after GRPO:
These are produced by plot_rewards() reading reward_log.csv written by EpisodeLogger, plus the post-hoc plots generated during the GRPO notebook run.
3. Optuna hyperparameter search
Search space
| Parameter | Range | Reason |
|---|---|---|
learning_rate |
[1e-6, 1e-4] log |
GRPO is sensitive to LR; log-scale is the right prior |
beta |
[0.0, 0.1] |
KL coefficient. 0 = pure RL (drift risk), 0.1 = anchored to SFT |
num_generations |
{4, 8} |
Group size. Larger β tighter advantage estimates but slower |
temperature |
[0.7, 1.0] |
Exploration knob |
top_p |
[0.85, 0.98] |
Nucleus sampling |
lora_r |
{8, 16, 32} |
Adapter capacity |
lora_alpha_mul |
{1, 2, 4} |
lora_alpha = lora_r Γ multiplier |
max_turns |
{4, 6, 8} |
Episode length cap |
Objective
objective = 0.7 Γ achieved_rate + 0.3 Γ mean_progress
Calculated on the held-out validation tasks at the end of each trial. Weighting achieved_rate higher matches the project goal β actual task completion matters more than partial progress.
Sampler
optuna.samplers.TPESampler(seed=42) β Tree-structured Parzen Estimator. TPE outperforms random search on 8-dim spaces with ~6 trials in our experience.
Persisted to outputs/.../optuna.db (SQLite), so trials can be resumed if a Colab session disconnects.
Frozen validation set
pick_validation_task_ids(k_per_tier=2, seed=42) picks 2 tasks per tier (β10 tasks total) at the start of training. The same set is used by every Optuna trial and the final post-training eval β no benchmark leakage between trials.
SFT-stage Optuna results (6 trials)
The SFT-stage Optuna run explored a 5-parameter space (lora_r, lora_alpha_mul, lora_dropout, learning_rate, warmup_ratio). 6 trials, validation loss as objective (lower = better):
| Trial | r | Ξ± | dropout | lr | warmup | val_loss |
|---|---|---|---|---|---|---|
| 0 | 16 | 16 | 0.006 | 4.03e-4 | 0.10 | 0.0523 β |
| 1 | 16 | 16 | 0.030 | 2.33e-4 | 0.03 | 0.0790 |
| 2 | 8 | 32 | 0.020 | 2.29e-4 | 0.03 | 0.0587 |
| 3 | 8 | 16 | 0.030 | 1.17e-4 | 0.03 | 0.1199 |
| 4 | 16 | 16 | 0.031 | 2.31e-4 | 0.03 | 0.0793 |
| 5 | 8 | 32 | 0.009 | 1.37e-4 | 0.10 | 0.0828 |
{
"best_value": 0.052,
"best_params": {
"lora_r": 16,
"lora_alpha_mul": 1, // β lora_alpha = 16
"lora_dropout": 0.005808,
"learning_rate": 4.03e-4,
"warmup_ratio": 0.1
}
}
Visualized:
GRPO-stage Optuna results (4 trials)
The GRPO-stage Optuna run explored a 3-parameter space (learning_rate, beta, temperature). 4 trials, single-step env reward as objective (higher = better):
| Trial | lr | Ξ² | T | env_reward | success |
|---|---|---|---|---|---|
| 0 | varied | varied | varied | 0.473 | 25.0% |
| 1 | varied | varied | varied | 0.469 | 25.0% |
| 2 | varied | varied | varied | 0.469 | 25.0% |
| 3 | 1.60e-5 | 0.0021 | 0.99 | 0.552 | 33.3% β |
The winning GRPO config uses a much smaller learning rate (1.6e-5, vs 4.0e-4 for SFT) and a tiny KL coefficient (Ξ²=0.0021) β both expected for an RL phase that is only correcting the SFT-bootstrapped policy, not retraining it.
4. Multi-turn rollouts + parallel envs
This section is a quick overview β the full mechanics, including the three pool layers and asyncio orchestration, are in scripts/README.md.
MultiTurnEnvPool
train_grpo.py:MultiTurnEnvPool β owns a background thread running an asyncio loop, opens N WebSocket sessions on startup, exposes a synchronous run_group(task, ...) API.
- One pool instance lives for the duration of training
run_group()callsasyncio.gather()overrollout_one_episode(env, task, ...)for each of the N envs β every rollout runs the same task in its own MiniStack (see server-side pool in server/README.md Β§6)- Returns a list of
{prompt_ids, completion_ids, logprobs, task_reward, task_achieved, final_progress, num_steps, transcript, task_id, difficulty}
Why parallelism matters here
GRPO's group-relative advantage requires G rollouts before any gradient. Running them serially at MAX_TURNS=6 turns Γ ~50 ms env step = ~300 ms per rollout would cost 2.4 s Γ G=8 = ~20 s of env time per training step. With parallel rollouts that drops to ~300 ms (the slowest of 8). The model forward pass dominates, exactly as desired.
Generation lock
Because the policy lives on a single GPU, model.generate() calls across the asyncio.gather group are serialised behind a _GENERATE_LOCK (threading.Lock). The env step calls β the slow part β happily overlap. This is the single non-obvious detail that makes the parallel rollout approach actually work.
5. Training modes (CLI)
# Optuna search only β produces best_cfg.json
python train_grpo.py --mode optuna --n-trials 6 --trial-max-steps 30
# Train once with explicit hyperparams (no search)
python train_grpo.py --mode train \
--env-url http://localhost:8000 \
--num-generations 8 --max-turns 6 --max-steps 200
# Search β train: Optuna trials, then a full-length run with the best config
python train_grpo.py --mode full --n-trials 6 --max-steps 200
All modes write to outputs/aws-rl-grpo-<TIMESTAMP>/.
6. How to run
Prerequisites
- A running env server:
make runfrom the repo root (starts MiniStack + FastAPI onhttp://localhost:8000) - For pool size > 1:
AWS_RL_ENV_POOL_SIZE=8 make run - A GPU with β₯ 24 GB VRAM (A10, T4Γ2, A100, L4 all confirmed working)
- HuggingFace token (
HF_TOKEN) if you want to push the trained adapter
Local
# 1. Start the env server in one terminal
AWS_RL_ENV_POOL_SIZE=8 make run
# 2. Run training in another terminal
python train_grpo.py --mode full --n-trials 6 --max-steps 200
Colab
The notebook aws_rl_env_colab.ipynb wraps the full pipeline (env URL config, HF login, val set, Optuna, training, plotting, optional push-to-Hub):
| Notebook | Open in Colab |
|---|---|
| GRPO end-to-end driver | |
| SFT-only (train/train_sft_lora.ipynb) | |
| GRPO-only (train/train_grpo_lora.ipynb) |
Note: the Colab notebooks expect the env server to be reachable. Two options:
- HF Space tunnel: deploy the env to your own HF Space and point
ENV_URLat it (see main README's deployment section) - ngrok: run the env locally and expose it via ngrok / cloudflared so Colab can reach it
7. Logging and artifacts
Reference training runs (numbers baked into this documentation)
The headline numbers and plots in this repo come from two reference training runs we performed end-to-end:
- SFT reference run β 188 SFT steps with the best Optuna trial. Achieved val loss 0.052 (best of 6 trials). Post-SFT eval delta: format
33% β 100%, exact39% β 89%, latency2.03s β 1.40s. The training curves, Optuna plots, and eval comparisons from this run live indocs/figures/(sft_loss_curve.png,optuna_*.png,base_vs_sft_success.png, β¦). - GRPO reference run β 35 GRPO steps with the best Optuna trial. Achieved single-step env reward 0.55 (best of 4 trials). Multi-step eval (nβ108): success
86.8% β 86.2%, beginner+3.8 pp, intermediate+6.0 pp, expert flat at 22%. The training signals, by-tier breakdowns, and qualitative rollouts from this run also live indocs/figures/(grpo_final_per_step.png,grpo_reward_curve.png,sft_vs_grpo_*.png,qualitative_rollouts.png, β¦).
The raw training-output directories (TRL checkpoints, optimizer states, exported adapters totalling ~330 MB) are not committed. The metrics, hyperparameters, and visualizations they produced are preserved inline in this README and as PNGs under docs/figures/.
GRPO output layout
Each GRPO run writes to a fresh outputs/aws-rl-grpo-<TIMESTAMP>/:
| File | Written by | Contents |
|---|---|---|
reward_log.csv |
EpisodeLogger |
One row per rollout: step, rollout_idx, task_id, difficulty, task_reward, task_achieved, final_progress, num_steps, tier, tier_success_rate, timestamp |
transcripts.jsonl |
EpisodeLogger |
Same rows + the full multi-turn transcript per rollout (commands, outputs, rewards) |
optuna.db |
Optuna | SQLite study (resumable) |
best_cfg.json |
optuna_search() |
Final winning hyperparameters |
trial_NNN/ |
_run_one_trial() |
Per-trial trainer checkpoints + trial_metrics.json |
val_task_ids.json |
Notebook driver | Frozen held-out validation set (for reproducibility) |
post_train_val.json |
Notebook Β§10 | Final post-training validation metrics |
reward_plot.png |
plot_rewards() |
Group mean reward + per-tier scatter |
<adapter_dir>/ |
TRL GRPOTrainer.save |
Trained LoRA adapter (adapter_config.json, adapter_model.safetensors, etc.) |
Push to HF Hub:
from huggingface_hub import create_repo, upload_folder
create_repo("your-org/aws-rl-grpo-qwen25coder3b", exist_ok=True, private=False)
upload_folder(folder_path=str(OUTPUT_DIR), repo_id="your-org/aws-rl-grpo-qwen25coder3b")
8. Reproducing results
Actual SFT result
SFT (188 steps, best Optuna trial, ~30 min on A10):
best val_loss : 0.052
best lora_r : 16
best lora_alpha : 16 (alpha_mul=1)
best lora_dropout: 0.0058
best lr : 4.03e-4
best warmup : 0.10
Held-out eval (post-SFT, same prompts as base):
format_pct : 33.3% β 100.0% (+66.7 pp)
exact_pct : 38.9% β 88.9% (+50.0 pp)
service_pct : 77.8% β 88.9% (+11.1 pp)
operation_pct : 61.1% β 88.9% (+27.8 pp)
avg_latency : 2.03s β 1.40s (β0.63s)
avg_len : 85.8 β 74.7 (tighter outputs)
Every target from data/sft/MODEL_EVALUATION.md Β§11 is met or exceeded.
Actual GRPO result
GRPO (35 steps from best Optuna trial, ~1.5 hr on A10):
best lr : 1.60e-5
best beta : 0.0021
best temperature : 0.99
num_generations : 8
Per-step training signals (16 reward-logged steps):
env_reward (mean): 0.31 max: 0.94 min: 0.13
KL to SFT ref : 0.15 mean (small Ξ² = 0.0021 keeps drift in check)
format_reward : 1.00 every step (perfect format compliance)
completion length: 87 tokens mean (compact AWS CLI commands)
Multi-step end-to-end eval (nβ108 episodes):
Base+SFT Base+SFT+GRPO Ξ
overall_success 86.8% 86.2% β0.5 pp
overall_reward 0.883 0.877 β0.006
beginner_success 96.2% 100.0% +3.8 pp β
intermediate_success 81.0% 87.0% +6.0 pp β
warmup_success 96.0% 90.2% β5.8 pp
expert_success 22.2% 22.2% flat (bottleneck)
drift_repair 22.2% 22.2% flat
destructive_fail 15.1% 14.7% β0.4 pp
steps_to_solve 1.45 1.55 +0.10
Honest reading. A 35-step GRPO run from a strong SFT starting point (already 86.8% success) is short by RL standards. It preserves the SFT gains, modestly improves the middle tiers, but does not crack the expert-tier ceiling β the 22% expert / 22% drift-repair numbers stay flat because there are too few expert episodes in 35 GRPO steps Γ G=8 = 280 rollouts, with the curriculum focusing primarily on warmup/beginner/intermediate.
Variance comes mostly from Optuna trial composition. The published SFT adapter (Sizzing/aws-rl-sft-qwen25coder3b-adapter) is the SFT result; the GRPO adapter regenerates per-run from the trainer's output directory.
9. Files in this directory
| File | Purpose |
|---|---|
| train_sft_lora.ipynb | Stage 1 β supervised LoRA fine-tuning |
| train_grpo_lora.ipynb | Stage 2 β GRPO RL training (clean) |
| train_grpo_lora_with_outputs.ipynb | Same notebook with cell outputs preserved |
Heavy logic referenced from these notebooks:
- train_grpo.py β the
MultiTurnEnvPool, GRPO config, Optuna search,plot_rewards, and therun_trainingentry point - aws_rl_env_colab.ipynb β Colab driver that imports from
train_grpo.py
See also
- Main README
- data/README.md β dataset generation, base-model selection
- data/sft/MODEL_EVALUATION.md β full 11-model benchmark
- scripts/README.md β parallel-rollout architecture deep-dive
- server/README.md β environment internals (curriculum, reward shaping, anti-hacking)
- compare/README.md β base vs SFT comparison harness
















