aws_rl_env / compare /README.md
Sizzing's picture
Upload folder using huggingface_hub
71e54ee verified

compare/ β€” Base Model vs SFT Adapter Benchmark

← back to main README

This directory holds the side-by-side benchmark that answers the only question that ultimately matters: did SFT actually make the model better at the task?

The benchmark compares the base Qwen2.5-Coder-3B-Instruct against our published SFT adapter Sizzing/aws-rl-sft-qwen25coder3b-adapter under two evaluation modes β€” fast static dataset eval and slow live-environment eval. Both write structured metrics so the deltas are explicit.

Dataset comparison: base vs SFT (per-row scores) RL-env comparison: base vs SFT (per-episode rewards)


Table of contents

  1. What's compared
  2. Two evaluation modes
  3. Methodology
  4. Metrics reported
  5. How to run
  6. Reading the results
  7. Files in this directory

1. What's compared

Base SFT
Model unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit Same base + LoRA adapter
Adapter None Sizzing/aws-rl-sft-qwen25coder3b-adapter
Training data Pretraining + Qwen instruction tuning + 1,500 rows from data/sft/aws_rl_sft.train.jsonl
Inference Same prompt template, same temperature Identical

The only variable is the LoRA adapter. Same base, same prompts, same decoding parameters, same evaluation set.


2. Two evaluation modes

The notebook runs two separate evaluations because they answer different questions:

Dataset eval (static)

Question Does the model emit the canonical command for held-out prompts, one-shot?
Speed Fast (~minutes)
Needs HF token + dataset access; no env server
Source data/sft/aws_rl_sft.val.jsonl (150 held-out rows)
Verifies Format correctness + command-token match against canonical

This is the same kind of pattern-matching benchmark as data/sft/MODEL_EVALUATION.md β€” fast and deterministic. Useful as a regression check.

RL env eval (live)

Question Can the model actually solve a task end-to-end against a live environment?
Speed Slow (~tens of minutes per model)
Needs Dataset eval above + a running env server (HF Space or local)
Source Same val tasks, but exercised through client.AwsRlEnv round-trips
Verifies Multi-step task completion, partial progress, reward shaping, hint usage

This is closer to what training optimizes for. A model can score well on dataset eval (right command on step 1) but fail RL env eval (can't recover from a step 1 typo, can't continue past the first turn). Both signals matter.


3. Methodology

Dataset eval

  1. Load Sizzing/aws-rl-sft dataset from HF Hub
  2. For each row in val, build the prompt from messages[:-1] (system + user, drop assistant)
  3. Generate the model's response (max_new_tokens=128, deterministic decoding)
  4. Extract the AWS CLI line: strip markdown fences, find first line starting with aws
  5. Score against messages[-1].content (the canonical assistant response):
    • Format OK (extracted line starts with aws)
    • Service match (same first word after aws)
    • Operation match (same first two words)
    • Exact match (full token-for-token equality)

This mirrors the methodology in eval_lm_studio_models.py; the same scoring functions are reused.

RL env eval

  1. Connect to the running env at ENV_BASE_URL (default: an HF Space; can be overridden to local)
  2. For each val task, run a full episode (up to MAX_STEPS=15 turns):
    • Build the prompt from system + task + observation history (matches inference.py)
    • Generate one AWS CLI command per turn
    • Step the environment, record reward, task_achieved, partial_progress
  3. Aggregate per-episode metrics

The agent loop is identical to the training-time rollout_one_episode in train_grpo.py β€” same prompt structure, same generation parameters, same termination logic. So the RL env eval is genuinely measuring "what would this model do during a GRPO rollout".


4. Metrics reported

Dataset eval

Metric Definition
format_ok % of responses where the extracted line starts with aws
svc_match % matching the canonical service
op_match % matching service + operation
exact_match % matching the full canonical command token-for-token

RL env eval (per episode)

Metric Definition
avg_episode_reward Mean total reward accumulated per episode (sum of step rewards)
completion_rate % of episodes ending in task_achieved=True
avg_steps_to_complete Mean steps used by completed episodes (lower = more efficient)
avg_max_progress Mean of the highest partial_progress reached per episode
hint_usage_rate % of episodes where the agent requested at least one hint
format_failure_rate % of agent commands that failed the aws prefix gate

The notebook produces per-tier breakdowns of all six metrics so you can see where SFT helped most (typically: warmup format-locking goes from ~85% β†’ 100%; intermediate completion goes from a small base to a meaningful fraction).


5. How to run

Prerequisites

  • HuggingFace token (HF_TOKEN) β€” needed to load the dataset and adapter
  • A running env server β€” either:
    • Your own HF Space deployment (set ENV_BASE_URL accordingly), or
    • Local server: make run from the repo root, then ENV_BASE_URL=http://localhost:8000
  • A GPU runtime (Colab T4 or better, A10/A100 ideal)

Notebooks

Notebook Open in Colab
compare_base_vs_sft.ipynb (clean)
compare_base_vs_sft_with_outputs.ipynb (with outputs)

The two notebooks are functionally identical; the second has cell outputs preserved (18 display widgets, 26 stdout cells) for offline inspection.

Running steps

  1. Open the notebook in Colab (or local Jupyter)
  2. Edit the CONFIG cell:
    BASE_MODEL        = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit"
    SFT_ADAPTER_REPO  = "Sizzing/aws-rl-sft-qwen25coder3b-adapter"
    DATASET_REPO      = "Sizzing/aws-rl-sft"
    ENV_BASE_URL      = "https://your-hf-space.hf.space"   # or local
    
  3. Run all cells. Part 1 (dataset eval) finishes first; Part 2 (RL env eval) is the slow one.
  4. Compare the per-metric deltas between base and SFT.

6. Reading the results

Actual numbers from the run

From the saved outputs of compare_base_vs_sft_with_outputs.ipynb:

Dataset eval

Metric Base Base + SFT Ξ”
format_pct 33.3% 100.0% +66.7 pp
format_after_extract_pct 100.0% 100.0% 0
exact_pct 38.9% 88.9% +50.0 pp

RL env eval (live multi-step agent loop)

Metric Base Base + SFT Ξ”
avg_episode_reward 1.187 2.011 +0.824
reward_std 1.137 1.908 +0.771
avg_steps 8.600 5.733 βˆ’2.867
avg_reward_per_step 0.138 0.351 +0.213

RL-env eval: base vs SFT

The agent earns more reward per episode while taking fewer steps β€” exactly what good fine-tuning should produce. Reward-per-step jumps 2.5Γ— because (a) the agent picks the right command more often (fewer wasted steps), and (b) format compliance is now perfect (no more aws help fallbacks).

Per-tier success in the RL eval

From the notebook's per-rollout traces (3 episodes per tier Γ— 5 tiers = 15 episodes per model):

Tier Base (rollouts βœ“ / 3) Base + SFT (rollouts βœ“ / 3)
warmup 3 3
beginner 3 3
intermediate 1 3
advanced 0 1
expert 0 2

SFT moves the success frontier up two tiers β€” the base model could not finish a single advanced or expert episode, while SFT completes 2 of 3 expert tasks (S3 lockdown, IAM least-privilege variants) within 5 steps.

What counts as a meaningful delta?

The val set is small (150 rows / ~10 unique tasks per RL eval), so individual percentage points have meaningful noise. Rules of thumb:

Delta size Significance
Β±2pp Within noise β€” don't claim improvement
5–10pp Likely real, look at per-tier breakdown
>10pp Almost certainly real

The deltas above (66.7 pp, 50.0 pp on dataset; 0.82 reward / βˆ’2.9 steps on RL eval) are well above the noise floor.

Going further with GRPO

Once the SFT adapter is in hand, the same comparison can be re-run against a GRPO adapter. Multi-step results from our reference GRPO run are documented in the main README Β§11; the short version is GRPO@35-steps preserves SFT performance and modestly improves the middle tiers, while the expert tier remains the bottleneck.


7. Files in this directory

File Purpose
compare_base_vs_sft.ipynb Side-by-side dataset + RL env benchmark β€” clean version
compare_base_vs_sft_with_outputs.ipynb Same notebook with cell outputs preserved (18 display widgets)

See also