aws_rl_env / train /README.md
Sizzing's picture
Upload folder using huggingface_hub
b13d4d9 verified

train/ β€” SFT + GRPO Training Pipeline

← back to main README

This directory holds the training notebooks for the AWS RL agent. Heavy logic for the GRPO loop lives at the repo root in train_grpo.py; the notebooks here are thin drivers that you can run end-to-end on Colab.

The training pipeline has two stages:

                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ data/sft/ ──────────┐
                      β”‚  1,500 train Β· 150 val rows   β”‚
                      β”‚  5 trajectory types           β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  STAGE 1 β€” Supervised Fine-Tuning  (train_sft_lora.ipynb)           β”‚
   β”‚  Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) β†’ SFT adapter  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚ Sizzing/aws-rl-sft-qwen25coder3b-adapter
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  STAGE 2 β€” GRPO RL                  (train_grpo_lora.ipynb)         β”‚
   β”‚  G=8 parallel rollouts Β· multi-turn Β· reward = env return           β”‚
   β”‚  Optuna over (lr, Ξ², G, T, top_p, lora_r, max_turns)                β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The two stages are intentionally separable: the SFT adapter is published to the Hugging Face Hub so anyone can pull it and start GRPO without re-running SFT.


Table of contents

  1. SFT stage β€” supervised LoRA
  2. GRPO stage β€” reinforcement learning
  3. Optuna hyperparameter search
  4. Multi-turn rollouts + parallel envs
  5. Training modes (CLI)
  6. How to run
  7. Logging and artifacts
  8. Reproducing results
  9. Files in this directory

1. SFT stage β€” supervised LoRA

train/train_sft_lora.ipynb β€” primary SFT notebook.

Why SFT before GRPO?

Two reasons β€” both showed up in our base-model evaluation (data/sft/MODEL_EVALUATION.md):

  1. Format-locking. Even strong coder models occasionally wrap commands in markdown fences or quotes. SFT removes that surface noise in one epoch.
  2. Bootstrap the GRPO reward signal. GRPO with a base model that's only 41% exact-match starts from a low-density reward landscape. Pre-training on canonical commands raises the baseline so GRPO can spend its compute on optimization, not search.

Base model

Choice unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit
Why Highest exact-match (41%) of 11 candidates we benchmarked, fastest viable inference (3.1 s/call), tightest output (86 chars). Full reasoning in data/sft/MODEL_EVALUATION.md.
Loader Unsloth's 4-bit quantized variant β€” fits comfortably on a single 24 GB GPU, 2Γ— faster training kernels

LoRA config

LoraConfig(
    r              = trial.suggest_categorical("lora_r", [8, 16, 32]),
    lora_alpha     = r * trial.suggest_categorical("lora_alpha_mul", [1, 2, 4]),
    lora_dropout   = trial.suggest_float("lora_dropout", 0.005, 0.031),
    bias           = "none",
    task_type      = "CAUSAL_LM",
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
)
  • Only attention projections are adapted β€” MLP / output heads stay frozen, keeping the trainable parameter count tiny (~10–40 M depending on rank).
  • lora_alpha = r Γ— multiplier keeps the effective scaling stable across rank variations during the Optuna search.

Optimization

Hyperparameter Value / Range
Optimizer AdamW (Unsloth's fused implementation)
Learning rate [1e-4, 5e-4] log-scale (Optuna)
Schedule Cosine annealing
Warmup ratio {0.03, 0.1} (Optuna; best 0.1)
Batch size 2 per GPU
Epochs 2
Max sequence length 512
Packing Disabled (we keep chat-template separators intact)
Loss masking Assistant-only (user message tokens are masked from the loss)

Dataset

data/sft/aws_rl_sft.train.jsonl β€” 1,500 examples. Format:

{
  "messages": [
    {"role": "system", "content": "You are an AWS cloud engineer..."},
    {"role": "user", "content": "TASK: ...\n\nCURRENT OBSERVATION:\nProgress: 0.00 ..."},
    {"role": "assistant", "content": "aws s3 mb s3://my-app-data"}
  ],
  "difficulty": "intermediate",
  "source": "success_first_step",
  "task_id": 42
}

The dataset is a careful mix of 5 trajectory types (success, multi-step continuation, failure recovery, verification, hint usage). Full generation methodology in data/README.md.

Training graphs

A reference SFT run achieved validation loss 0.052 after 188 training steps with the best Optuna trial. The plots below were exported from that run into docs/figures/.

SFT loss curve


2. GRPO stage β€” reinforcement learning

The core trainer lives at train_grpo.py (1,283 LOC). Notebooks call into it:

What GRPO is, briefly

GRPO (Group Relative Policy Optimization) is the algorithm introduced by DeepSeekMath and adopted by TRL β‰₯ 0.18. Unlike PPO, GRPO does not train a critic. Instead:

  1. For one prompt (here, one curriculum-picked task), generate G completions
  2. Score each with the reward function(s)
  3. Compute group-relative advantage: (reward_i βˆ’ group_mean) / group_std
  4. Backpropagate the policy gradient with that advantage
  5. Apply a KL penalty to the SFT reference model (coefficient Ξ²) to prevent drift

This is dramatically simpler than PPO (no value head, no GAE), more sample-efficient for verifier-style rewards, and a natural fit for our setup β€” the AWS RL env is the reward function.

TRL GRPOTrainer config

From train_grpo.py:_build_grpo_config():

Parameter Default value Notes
learning_rate 5e-6 Optuna range [1e-6, 1e-4] log-scale
beta (KL coefficient) 0.04 Optuna range [0.0, 0.1]
num_generations (G) 8 Optuna {4, 8}
temperature 0.9 Optuna [0.7, 1.0]
top_p 0.95 Optuna [0.85, 0.98]
per_device_train_batch_size 1
gradient_accumulation_steps 8 Effective batch 8
gradient_checkpointing True use_reentrant=False β€” VRAM optimization
max_completion_length 256 Per-turn; one AWS CLI command fits comfortably
max_prompt_length 2048 Holds task + history + observation
loss_type "dapo" Distributional Advantage Policy Optimization (TRL default for GRPO)
mask_truncated_completions True Drop samples that hit max_completion_length
warmup_ratio 0.05
lr_scheduler_type "cosine"
max_grad_norm 1.0
use_vllm False Plain model.generate() β€” vLLM integration is future work

Reward functions (TRL convention)

Three reward functions are registered, summed by GRPO:

reward_funcs=[reward_task, reward_achieved, reward_progress]
  • reward_task(completions, **kwargs) β†’ episode return (sum of per-step env rewards). The dominant signal.
  • reward_achieved(completions, **kwargs) β†’ 1.0 if task.task_achieved at end of episode, else 0.0. Sparse but unambiguous.
  • reward_progress(completions, **kwargs) β†’ final partial_progress ∈ [0, 1]. Densifies the credit assignment for partial completions.

The env's reward shaping (see server/README.md Β§8) does most of the work β€” these three TRL functions are a thin faΓ§ade.

Episode = one rollout

  • Each rollout runs up to MAX_TURNS=6 sequential AWS CLI commands
  • Each command's stdout/stderr/progress is fed back as the user message for the next turn (see build_user_prompt() and format_observation() in train_grpo.py)
  • The episode terminates on task_achieved, max turns, or max_total_tokens (per-episode token budget)
  • Token sequences (prompt_ids, completion_ids, logprobs) are accumulated across turns, so GRPO assigns the episode-level reward to the full multi-turn token sequence β€” not just the last turn

Curriculum integration

trainer step:
  1. task = curriculum.next_task()                # one task per GRPO step
  2. results = pool.run_group(task, ...)          # G rollouts on that task
  3. mean_r = sum(group_rewards) / G
  4. curriculum.record_result(task, achieved=any_achieved, reward=mean_r)
  5. trainer applies group-relative advantages    # standard GRPO

The curriculum drives task selection β€” every rollout in a group runs the same task, forced through env.reset(task=task). This matches GRPO's group-relative semantics (you need the same prompt across the group to compute baseline correctly).

Full curriculum mechanics (priority scoring, mastery, spaced rep, tier promotion) live in server/README.md Β§7.

Training graphs

A reference GRPO run trained 35 steps with the best Optuna config (lr=1.6e-5, Ξ²=0.0021, T=0.99). Per-step training signals (extracted from the run's trainer_state.json) are mirrored into docs/figures/:

GRPO final per-step training signals GRPO env reward over training Success by tier (multi-step) Reward by tier (multi-step)

Notable signals from the run:

env_reward/mean 0.31 (mean over 16 reward-logged steps), max 0.94, min 0.13
kl 0.15 (mean) β€” KL stays small despite tiny Ξ²
completion_length 87 tokens (mean) β€” agent emits compact AWS CLI commands
Format compliance 100% (format_reward/mean = 1.0 every step)

Multi-step end-to-end re-eval after GRPO:

SFT vs GRPO multi-step metrics grid

These are produced by plot_rewards() reading reward_log.csv written by EpisodeLogger, plus the post-hoc plots generated during the GRPO notebook run.


3. Optuna hyperparameter search

train_grpo.py:optuna_search()

Search space

Parameter Range Reason
learning_rate [1e-6, 1e-4] log GRPO is sensitive to LR; log-scale is the right prior
beta [0.0, 0.1] KL coefficient. 0 = pure RL (drift risk), 0.1 = anchored to SFT
num_generations {4, 8} Group size. Larger β†’ tighter advantage estimates but slower
temperature [0.7, 1.0] Exploration knob
top_p [0.85, 0.98] Nucleus sampling
lora_r {8, 16, 32} Adapter capacity
lora_alpha_mul {1, 2, 4} lora_alpha = lora_r Γ— multiplier
max_turns {4, 6, 8} Episode length cap

Objective

objective = 0.7 Γ— achieved_rate + 0.3 Γ— mean_progress

Calculated on the held-out validation tasks at the end of each trial. Weighting achieved_rate higher matches the project goal β€” actual task completion matters more than partial progress.

Sampler

optuna.samplers.TPESampler(seed=42) β€” Tree-structured Parzen Estimator. TPE outperforms random search on 8-dim spaces with ~6 trials in our experience.

Persisted to outputs/.../optuna.db (SQLite), so trials can be resumed if a Colab session disconnects.

Frozen validation set

pick_validation_task_ids(k_per_tier=2, seed=42) picks 2 tasks per tier (β‰ˆ10 tasks total) at the start of training. The same set is used by every Optuna trial and the final post-training eval β€” no benchmark leakage between trials.

SFT-stage Optuna results (6 trials)

The SFT-stage Optuna run explored a 5-parameter space (lora_r, lora_alpha_mul, lora_dropout, learning_rate, warmup_ratio). 6 trials, validation loss as objective (lower = better):

Trial r Ξ± dropout lr warmup val_loss
0 16 16 0.006 4.03e-4 0.10 0.0523 β˜…
1 16 16 0.030 2.33e-4 0.03 0.0790
2 8 32 0.020 2.29e-4 0.03 0.0587
3 8 16 0.030 1.17e-4 0.03 0.1199
4 16 16 0.031 2.31e-4 0.03 0.0793
5 8 32 0.009 1.37e-4 0.10 0.0828

SFT Optuna trial comparison table

{
  "best_value": 0.052,
  "best_params": {
    "lora_r": 16,
    "lora_alpha_mul": 1,            // β†’ lora_alpha = 16
    "lora_dropout": 0.005808,
    "learning_rate": 4.03e-4,
    "warmup_ratio": 0.1
  }
}

Visualized:

Optuna parameter importances Optuna optimization history Optuna parallel coordinate plot Optuna slice plot Optuna trial training curves

GRPO-stage Optuna results (4 trials)

The GRPO-stage Optuna run explored a 3-parameter space (learning_rate, beta, temperature). 4 trials, single-step env reward as objective (higher = better):

Trial lr Ξ² T env_reward success
0 varied varied varied 0.473 25.0%
1 varied varied varied 0.469 25.0%
2 varied varied varied 0.469 25.0%
3 1.60e-5 0.0021 0.99 0.552 33.3% β˜…

GRPO Optuna trial comparison GRPO Optuna importances GRPO Optuna parallel coordinate GRPO Optuna hparams GRPO Optuna trial curves

The winning GRPO config uses a much smaller learning rate (1.6e-5, vs 4.0e-4 for SFT) and a tiny KL coefficient (Ξ²=0.0021) β€” both expected for an RL phase that is only correcting the SFT-bootstrapped policy, not retraining it.


4. Multi-turn rollouts + parallel envs

This section is a quick overview β€” the full mechanics, including the three pool layers and asyncio orchestration, are in scripts/README.md.

MultiTurnEnvPool

train_grpo.py:MultiTurnEnvPool β€” owns a background thread running an asyncio loop, opens N WebSocket sessions on startup, exposes a synchronous run_group(task, ...) API.

  • One pool instance lives for the duration of training
  • run_group() calls asyncio.gather() over rollout_one_episode(env, task, ...) for each of the N envs β€” every rollout runs the same task in its own MiniStack (see server-side pool in server/README.md Β§6)
  • Returns a list of {prompt_ids, completion_ids, logprobs, task_reward, task_achieved, final_progress, num_steps, transcript, task_id, difficulty}

Why parallelism matters here

GRPO's group-relative advantage requires G rollouts before any gradient. Running them serially at MAX_TURNS=6 turns Γ— ~50 ms env step = ~300 ms per rollout would cost 2.4 s Γ— G=8 = ~20 s of env time per training step. With parallel rollouts that drops to ~300 ms (the slowest of 8). The model forward pass dominates, exactly as desired.

Generation lock

Because the policy lives on a single GPU, model.generate() calls across the asyncio.gather group are serialised behind a _GENERATE_LOCK (threading.Lock). The env step calls β€” the slow part β€” happily overlap. This is the single non-obvious detail that makes the parallel rollout approach actually work.


5. Training modes (CLI)

# Optuna search only β€” produces best_cfg.json
python train_grpo.py --mode optuna --n-trials 6 --trial-max-steps 30

# Train once with explicit hyperparams (no search)
python train_grpo.py --mode train \
    --env-url http://localhost:8000 \
    --num-generations 8 --max-turns 6 --max-steps 200

# Search β†’ train: Optuna trials, then a full-length run with the best config
python train_grpo.py --mode full --n-trials 6 --max-steps 200

All modes write to outputs/aws-rl-grpo-<TIMESTAMP>/.


6. How to run

Prerequisites

  • A running env server: make run from the repo root (starts MiniStack + FastAPI on http://localhost:8000)
  • For pool size > 1: AWS_RL_ENV_POOL_SIZE=8 make run
  • A GPU with β‰₯ 24 GB VRAM (A10, T4Γ—2, A100, L4 all confirmed working)
  • HuggingFace token (HF_TOKEN) if you want to push the trained adapter

Local

# 1. Start the env server in one terminal
AWS_RL_ENV_POOL_SIZE=8 make run

# 2. Run training in another terminal
python train_grpo.py --mode full --n-trials 6 --max-steps 200

Colab

The notebook aws_rl_env_colab.ipynb wraps the full pipeline (env URL config, HF login, val set, Optuna, training, plotting, optional push-to-Hub):

Notebook Open in Colab
GRPO end-to-end driver
SFT-only (train/train_sft_lora.ipynb)
GRPO-only (train/train_grpo_lora.ipynb)

Note: the Colab notebooks expect the env server to be reachable. Two options:

  1. HF Space tunnel: deploy the env to your own HF Space and point ENV_URL at it (see main README's deployment section)
  2. ngrok: run the env locally and expose it via ngrok / cloudflared so Colab can reach it

7. Logging and artifacts

Reference training runs (numbers baked into this documentation)

The headline numbers and plots in this repo come from two reference training runs we performed end-to-end:

  • SFT reference run β€” 188 SFT steps with the best Optuna trial. Achieved val loss 0.052 (best of 6 trials). Post-SFT eval delta: format 33% β†’ 100%, exact 39% β†’ 89%, latency 2.03s β†’ 1.40s. The training curves, Optuna plots, and eval comparisons from this run live in docs/figures/ (sft_loss_curve.png, optuna_*.png, base_vs_sft_success.png, …).
  • GRPO reference run β€” 35 GRPO steps with the best Optuna trial. Achieved single-step env reward 0.55 (best of 4 trials). Multi-step eval (nβ‰ˆ108): success 86.8% β†’ 86.2%, beginner +3.8 pp, intermediate +6.0 pp, expert flat at 22%. The training signals, by-tier breakdowns, and qualitative rollouts from this run also live in docs/figures/ (grpo_final_per_step.png, grpo_reward_curve.png, sft_vs_grpo_*.png, qualitative_rollouts.png, …).

The raw training-output directories (TRL checkpoints, optimizer states, exported adapters totalling ~330 MB) are not committed. The metrics, hyperparameters, and visualizations they produced are preserved inline in this README and as PNGs under docs/figures/.

GRPO output layout

Each GRPO run writes to a fresh outputs/aws-rl-grpo-<TIMESTAMP>/:

File Written by Contents
reward_log.csv EpisodeLogger One row per rollout: step, rollout_idx, task_id, difficulty, task_reward, task_achieved, final_progress, num_steps, tier, tier_success_rate, timestamp
transcripts.jsonl EpisodeLogger Same rows + the full multi-turn transcript per rollout (commands, outputs, rewards)
optuna.db Optuna SQLite study (resumable)
best_cfg.json optuna_search() Final winning hyperparameters
trial_NNN/ _run_one_trial() Per-trial trainer checkpoints + trial_metrics.json
val_task_ids.json Notebook driver Frozen held-out validation set (for reproducibility)
post_train_val.json Notebook Β§10 Final post-training validation metrics
reward_plot.png plot_rewards() Group mean reward + per-tier scatter
<adapter_dir>/ TRL GRPOTrainer.save Trained LoRA adapter (adapter_config.json, adapter_model.safetensors, etc.)

Push to HF Hub:

from huggingface_hub import create_repo, upload_folder
create_repo("your-org/aws-rl-grpo-qwen25coder3b", exist_ok=True, private=False)
upload_folder(folder_path=str(OUTPUT_DIR), repo_id="your-org/aws-rl-grpo-qwen25coder3b")

8. Reproducing results

Actual SFT result

SFT (188 steps, best Optuna trial, ~30 min on A10):
  best val_loss    : 0.052
  best lora_r      : 16
  best lora_alpha  : 16  (alpha_mul=1)
  best lora_dropout: 0.0058
  best lr          : 4.03e-4
  best warmup      : 0.10

Held-out eval (post-SFT, same prompts as base):
  format_pct       : 33.3%  β†’  100.0%   (+66.7 pp)
  exact_pct        : 38.9%  β†’   88.9%   (+50.0 pp)
  service_pct      : 77.8%  β†’   88.9%   (+11.1 pp)
  operation_pct    : 61.1%  β†’   88.9%   (+27.8 pp)
  avg_latency      :  2.03s β†’    1.40s  (βˆ’0.63s)
  avg_len          :  85.8  β†’   74.7    (tighter outputs)

Every target from data/sft/MODEL_EVALUATION.md Β§11 is met or exceeded.

Actual GRPO result

GRPO (35 steps from best Optuna trial, ~1.5 hr on A10):
  best lr          : 1.60e-5
  best beta        : 0.0021
  best temperature : 0.99
  num_generations  : 8

Per-step training signals (16 reward-logged steps):
  env_reward (mean): 0.31      max: 0.94      min: 0.13
  KL to SFT ref    : 0.15 mean (small Ξ² = 0.0021 keeps drift in check)
  format_reward    : 1.00 every step (perfect format compliance)
  completion length: 87 tokens mean (compact AWS CLI commands)

Multi-step end-to-end eval (nβ‰ˆ108 episodes):
                       Base+SFT     Base+SFT+GRPO     Ξ”
  overall_success      86.8%        86.2%             βˆ’0.5 pp
  overall_reward       0.883        0.877             βˆ’0.006
  beginner_success     96.2%        100.0%            +3.8 pp βœ“
  intermediate_success 81.0%        87.0%             +6.0 pp βœ“
  warmup_success       96.0%        90.2%             βˆ’5.8 pp
  expert_success       22.2%        22.2%             flat (bottleneck)
  drift_repair         22.2%        22.2%             flat
  destructive_fail     15.1%        14.7%             βˆ’0.4 pp
  steps_to_solve       1.45         1.55              +0.10

Honest reading. A 35-step GRPO run from a strong SFT starting point (already 86.8% success) is short by RL standards. It preserves the SFT gains, modestly improves the middle tiers, but does not crack the expert-tier ceiling β€” the 22% expert / 22% drift-repair numbers stay flat because there are too few expert episodes in 35 GRPO steps Γ— G=8 = 280 rollouts, with the curriculum focusing primarily on warmup/beginner/intermediate.

Variance comes mostly from Optuna trial composition. The published SFT adapter (Sizzing/aws-rl-sft-qwen25coder3b-adapter) is the SFT result; the GRPO adapter regenerates per-run from the trainer's output directory.


9. Files in this directory

File Purpose
train_sft_lora.ipynb Stage 1 β€” supervised LoRA fine-tuning
train_grpo_lora.ipynb Stage 2 β€” GRPO RL training (clean)
train_grpo_lora_with_outputs.ipynb Same notebook with cell outputs preserved

Heavy logic referenced from these notebooks:

  • train_grpo.py β€” the MultiTurnEnvPool, GRPO config, Optuna search, plot_rewards, and the run_training entry point
  • aws_rl_env_colab.ipynb β€” Colab driver that imports from train_grpo.py

See also