Spaces:
Running
compare/ β Base Model vs SFT Adapter Benchmark
This directory holds the side-by-side benchmark that answers the only question that ultimately matters: did SFT actually make the model better at the task?
The benchmark compares the base Qwen2.5-Coder-3B-Instruct against our published SFT adapter Sizzing/aws-rl-sft-qwen25coder3b-adapter under two evaluation modes β fast static dataset eval and slow live-environment eval. Both write structured metrics so the deltas are explicit.
Table of contents
- What's compared
- Two evaluation modes
- Methodology
- Metrics reported
- How to run
- Reading the results
- Files in this directory
1. What's compared
| Base | SFT | |
|---|---|---|
| Model | unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit |
Same base + LoRA adapter |
| Adapter | None | Sizzing/aws-rl-sft-qwen25coder3b-adapter |
| Training data | Pretraining + Qwen instruction tuning | + 1,500 rows from data/sft/aws_rl_sft.train.jsonl |
| Inference | Same prompt template, same temperature | Identical |
The only variable is the LoRA adapter. Same base, same prompts, same decoding parameters, same evaluation set.
2. Two evaluation modes
The notebook runs two separate evaluations because they answer different questions:
Dataset eval (static)
| Question | Does the model emit the canonical command for held-out prompts, one-shot? |
|---|---|
| Speed | Fast (~minutes) |
| Needs | HF token + dataset access; no env server |
| Source | data/sft/aws_rl_sft.val.jsonl (150 held-out rows) |
| Verifies | Format correctness + command-token match against canonical |
This is the same kind of pattern-matching benchmark as data/sft/MODEL_EVALUATION.md β fast and deterministic. Useful as a regression check.
RL env eval (live)
| Question | Can the model actually solve a task end-to-end against a live environment? |
|---|---|
| Speed | Slow (~tens of minutes per model) |
| Needs | Dataset eval above + a running env server (HF Space or local) |
| Source | Same val tasks, but exercised through client.AwsRlEnv round-trips |
| Verifies | Multi-step task completion, partial progress, reward shaping, hint usage |
This is closer to what training optimizes for. A model can score well on dataset eval (right command on step 1) but fail RL env eval (can't recover from a step 1 typo, can't continue past the first turn). Both signals matter.
3. Methodology
Dataset eval
- Load
Sizzing/aws-rl-sftdataset from HF Hub - For each row in
val, build the prompt frommessages[:-1](system + user, drop assistant) - Generate the model's response (
max_new_tokens=128, deterministic decoding) - Extract the AWS CLI line: strip markdown fences, find first line starting with
aws - Score against
messages[-1].content(the canonical assistant response):- Format OK (extracted line starts with
aws) - Service match (same first word after
aws) - Operation match (same first two words)
- Exact match (full token-for-token equality)
- Format OK (extracted line starts with
This mirrors the methodology in eval_lm_studio_models.py; the same scoring functions are reused.
RL env eval
- Connect to the running env at
ENV_BASE_URL(default: an HF Space; can be overridden to local) - For each val task, run a full episode (up to
MAX_STEPS=15turns):- Build the prompt from system + task + observation history (matches inference.py)
- Generate one AWS CLI command per turn
- Step the environment, record
reward,task_achieved,partial_progress
- Aggregate per-episode metrics
The agent loop is identical to the training-time rollout_one_episode in train_grpo.py β same prompt structure, same generation parameters, same termination logic. So the RL env eval is genuinely measuring "what would this model do during a GRPO rollout".
4. Metrics reported
Dataset eval
| Metric | Definition |
|---|---|
format_ok |
% of responses where the extracted line starts with aws |
svc_match |
% matching the canonical service |
op_match |
% matching service + operation |
exact_match |
% matching the full canonical command token-for-token |
RL env eval (per episode)
| Metric | Definition |
|---|---|
avg_episode_reward |
Mean total reward accumulated per episode (sum of step rewards) |
completion_rate |
% of episodes ending in task_achieved=True |
avg_steps_to_complete |
Mean steps used by completed episodes (lower = more efficient) |
avg_max_progress |
Mean of the highest partial_progress reached per episode |
hint_usage_rate |
% of episodes where the agent requested at least one hint |
format_failure_rate |
% of agent commands that failed the aws prefix gate |
The notebook produces per-tier breakdowns of all six metrics so you can see where SFT helped most (typically: warmup format-locking goes from ~85% β 100%; intermediate completion goes from a small base to a meaningful fraction).
5. How to run
Prerequisites
- HuggingFace token (
HF_TOKEN) β needed to load the dataset and adapter - A running env server β either:
- Your own HF Space deployment (set
ENV_BASE_URLaccordingly), or - Local server:
make runfrom the repo root, thenENV_BASE_URL=http://localhost:8000
- Your own HF Space deployment (set
- A GPU runtime (Colab T4 or better, A10/A100 ideal)
Notebooks
| Notebook | Open in Colab |
|---|---|
| compare_base_vs_sft.ipynb (clean) | |
| compare_base_vs_sft_with_outputs.ipynb (with outputs) |
The two notebooks are functionally identical; the second has cell outputs preserved (18 display widgets, 26 stdout cells) for offline inspection.
Running steps
- Open the notebook in Colab (or local Jupyter)
- Edit the CONFIG cell:
BASE_MODEL = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit" SFT_ADAPTER_REPO = "Sizzing/aws-rl-sft-qwen25coder3b-adapter" DATASET_REPO = "Sizzing/aws-rl-sft" ENV_BASE_URL = "https://your-hf-space.hf.space" # or local - Run all cells. Part 1 (dataset eval) finishes first; Part 2 (RL env eval) is the slow one.
- Compare the per-metric deltas between base and SFT.
6. Reading the results
Actual numbers from the run
From the saved outputs of compare_base_vs_sft_with_outputs.ipynb:
Dataset eval
| Metric | Base | Base + SFT | Ξ |
|---|---|---|---|
format_pct |
33.3% | 100.0% | +66.7 pp |
format_after_extract_pct |
100.0% | 100.0% | 0 |
exact_pct |
38.9% | 88.9% | +50.0 pp |
RL env eval (live multi-step agent loop)
| Metric | Base | Base + SFT | Ξ |
|---|---|---|---|
avg_episode_reward |
1.187 | 2.011 | +0.824 |
reward_std |
1.137 | 1.908 | +0.771 |
avg_steps |
8.600 | 5.733 | β2.867 |
avg_reward_per_step |
0.138 | 0.351 | +0.213 |
The agent earns more reward per episode while taking fewer steps β exactly what good fine-tuning should produce. Reward-per-step jumps 2.5Γ because (a) the agent picks the right command more often (fewer wasted steps), and (b) format compliance is now perfect (no more aws help fallbacks).
Per-tier success in the RL eval
From the notebook's per-rollout traces (3 episodes per tier Γ 5 tiers = 15 episodes per model):
| Tier | Base (rollouts β / 3) | Base + SFT (rollouts β / 3) |
|---|---|---|
| warmup | 3 | 3 |
| beginner | 3 | 3 |
| intermediate | 1 | 3 |
| advanced | 0 | 1 |
| expert | 0 | 2 |
SFT moves the success frontier up two tiers β the base model could not finish a single advanced or expert episode, while SFT completes 2 of 3 expert tasks (S3 lockdown, IAM least-privilege variants) within 5 steps.
What counts as a meaningful delta?
The val set is small (150 rows / ~10 unique tasks per RL eval), so individual percentage points have meaningful noise. Rules of thumb:
| Delta size | Significance |
|---|---|
| Β±2pp | Within noise β don't claim improvement |
| 5β10pp | Likely real, look at per-tier breakdown |
| >10pp | Almost certainly real |
The deltas above (66.7 pp, 50.0 pp on dataset; 0.82 reward / β2.9 steps on RL eval) are well above the noise floor.
Going further with GRPO
Once the SFT adapter is in hand, the same comparison can be re-run against a GRPO adapter. Multi-step results from our reference GRPO run are documented in the main README Β§11; the short version is GRPO@35-steps preserves SFT performance and modestly improves the middle tiers, while the expert tier remains the bottleneck.
7. Files in this directory
| File | Purpose |
|---|---|
| compare_base_vs_sft.ipynb | Side-by-side dataset + RL env benchmark β clean version |
| compare_base_vs_sft_with_outputs.ipynb | Same notebook with cell outputs preserved (18 display widgets) |
See also
- Main README β top-level overview, results section
- data/README.md β dataset that drives this comparison
- data/sft/MODEL_EVALUATION.md β base-model selection benchmark (same scoring functions reused here)
- train/README.md β how the SFT adapter being benchmarked here was produced
- inference.py β single-model agent loop (the prototype the RL eval mode is modeled after)


