aws_rl_env / data /sft /MODEL_EVALUATION.md
Sizzing's picture
Upload folder using huggingface_hub
e56d042 verified

Model Evaluation β€” Picking the Best Base Model for SFT + GRPO on AWS RL Env

TL;DR

Train qwen2.5-coder-3b-instruct. It's the strongest candidate across every metric that matters for this task: highest exact-match rate, tightest outputs, and fast enough to not bottleneck GRPO rollouts. Full reasoning and per-model data below.


1. What this evaluation does

For each chat model loaded in LM Studio, we send 27 prompts drawn from our held-out validation split and measure how closely the model's output matches the canonical AWS CLI command that would solve the task. The goal is to pick the base model that:

  1. Starts strong β€” already understands AWS CLI syntax, so SFT can focus on task correctness instead of format-locking
  2. Has headroom β€” not so perfect that SFT overfits; not so weak that SFT can't help
  3. Is fast enough β€” GRPO generates G=8 rollouts per prompt Γ— many prompts Γ— many steps; inference cost compounds

This is a format-and-correctness screen. It does NOT measure:

  • Whether the model can run a multi-step task against the live env (that's a separate integration test)
  • Long-context behavior beyond ~500 tokens
  • Post-SFT performance (only base-model zero-shot)

2. Eval methodology

Prompts

  • Source: data/sft/aws_rl_sft.val.jsonl (150 rows)
  • Coverage: 3 examples per (tier, source) combo β†’ 27 prompts per model
  • Combos cover: warmup+beginner+intermediate tiers Γ— success_first_step + multi_step_continuation + failure_recovery + verification + hint_usage producers
  • Each prompt is sent exactly as inference.py would send it: system + user messages from the dataset, no assistant turn

Model invocation

  • Endpoint: LM Studio at http://localhost:1234/v1/chat/completions (OpenAI-compatible)
  • temperature: 0.0 (deterministic)
  • max_tokens: 120 (enough for any valid AWS command; truncates runaway prose)
  • timeout: 60s per call

Total budget

  • 11 chat models Γ— 27 prompts = 297 API calls, completed in ~15 minutes

3. Metrics β€” what each column means

Metric What it measures Why it matters
fmt% Raw model output starts with aws (no preamble, no fences, no prose) Inference-time gate: inference.py:93 rejects anything that doesn't start with aws and replaces it with aws help. High fmt% = fewer wasted env steps.
+xtr% After stripping markdown fences and leading prose, does the first aws ... line exist? Measures "the model knows the answer but wraps it in junk." If +xtr% >> fmt%, the gap is all format noise β€” a simple regex in inference.py could recover most of it, OR SFT can lock it cheaply.
exact% Extracted command matches the canonical command token-for-token The hardest metric. Hits all the way down to exact flag values and escaping. This is the ceiling SFT has to reach.
svc% Extracted command uses the same AWS service as canonical (e.g. both start with aws s3api) Measures domain orientation: does the model know "this task calls for DynamoDB" even if it gets the exact operation wrong?
op% Same AWS service AND same operation (e.g. both are aws s3api create-bucket) Measures how close the model is to correct β€” it knows what to do, maybe not with which flags. This is the gap SFT closes most reliably.
lat Mean seconds per call Matters for GRPO rollout throughput. G=8 rollouts Γ— 100 prompts Γ— 5 steps = 4000 generations per training epoch. At 10s/call that's 11 hours; at 3s it's 3.3 hours.
len Mean raw output length in characters Proxy for verbosity. Lower = more concentrated signal for SFT loss; higher = model likes to explain itself (bad for this task).

Symbols in per-call logs

  • βœ“ β€” exact match with canonical command
  • ~ β€” format valid (after extraction) but content doesn't match canonical
  • βœ— β€” either no valid aws line or the output is malformed

4. Full results β€” 11 models Γ— 27 prompts each

Model                                  n errs  fmt%  +xtr%  exact%  svc%   op%   lat   len
--------------------------------------------------------------------------------------------
qwen2.5-coder-3b-instruct             27    0   85%   100%     41%   70%   63%  3.1s   86  ⭐
qwen/qwen3-4b-2507                    27    0  100%   100%     33%   74%   59% 10.4s  108
qwen2.5-coder-1.5b-instruct           27    0   81%    85%     22%   48%   44%  2.5s  110
smollm2-1.7b-instruct                 27    0   63%    63%      7%   63%   37%  2.1s   87
smollm-360m-instruct                  27    0    0%    63%      0%   26%    7%  1.7s  402
smollm2-135m-instruct                 27    0    0%    59%      0%   15%    7%  1.1s  337
smollm-360m-instruct-v0.2             27    0    0%    56%      0%   15%    7%  2.2s  364
smollm2-360m-instruct                 27    0   52%    52%      0%   48%   33%  1.0s  137
smollm-1.7b-instruct-v0.2             27    0    0%    37%      0%   15%   11%  3.9s  342
smollm2-360m (base)                   27    0    0%     0%      0%    0%    0%  1.7s  390
deepseek-r1-distill-qwen-1.5b         27    0    0%     0%      0%    0%    0%  4.1s    0†

† DeepSeek-R1-Distill was truncated by max_tokens=120 during its <think>...</think> reasoning phase. We re-ran it separately with max_tokens=2048 β€” see section 6 for real numbers.

5. Per-model verdicts

⭐ qwen2.5-coder-3b-instruct β€” recommended

Evidence

  • exact% = 41% β€” highest of any model tested
  • op% = 63% β€” best service+operation recognition; it knows what most tasks need
  • len = 86 chars β€” tightest output in the test (even tighter than qwen3-4b at 108)
  • lat = 3.1s β€” 3.4Γ— faster than qwen3-4b with better accuracy
  • Correctly handled aws cognito-idp create-user-pool --pool-name app-users (intermediate tier)
  • Correctly handled aws rds create-db-instance --db-instance-identifier app-database --engine mysql (a notoriously long command)

Weaknesses

  • fmt% = 85% (not 100%) β€” occasionally wraps commands in '...' quotes or adds a trailing period. SFT fixes this in one epoch.
  • Sometimes picks the wrong operation within the right service (e.g. create-user-pool-client instead of create-user-pool). Failure-recovery rows in your SFT dataset address this directly.

Training implications

  • Recommended LoRA config: r=8, Ξ±=16, 2 epochs, lr=2e-4 β€” model is already strong enough that r=16 would memorize rather than generalize
  • Expected post-SFT performance: exact% > 75%, op% > 90%
  • Inference cost during GRPO: ~3Γ— cheaper than qwen3-4b

qwen/qwen3-4b-2507 β€” strong runner-up

Evidence

  • fmt% = 100% β€” the only model that never produces preamble, quotes, or fences
  • exact% = 33%, svc% = 74% β€” still very good
  • lat = 10.4s β€” 3Γ— slower than qwen2.5-coder-3b due to 33% more parameters

Weaknesses

  • The latency is a real problem for GRPO at scale β€” 10s Γ— G=8 rollouts Γ— 100 prompts = 2.2 hours per training step pair
  • Lower op% than qwen2.5-coder-3b (59% vs 63%) despite being larger β€” suggests coder-tuning beats raw scale for this task

Verdict: use only if post-SFT evaluation on qwen2.5-coder-3b falls short of expectations. Otherwise the smaller coder model dominates.


qwen2.5-coder-1.5b-instruct β€” the speed play

Evidence

  • fmt% = 81%, +xtr% = 85%, exact% = 22%
  • lat = 2.5s β€” fastest of the viable candidates
  • 1.5B parameters β€” ~2Γ— cheaper inference than the 3B

Weaknesses

  • 22% exact-match is a real accuracy gap from the 3B (41%)
  • Sometimes confuses related operations (e.g. put-secret-value instead of create-secret)

Verdict: keep as a fallback. If your GRPO budget is tight, the 2Γ— throughput might justify the accuracy hit β€” but only after confirming SFT can close the gap. Recommended only if you plan to run many thousands of GRPO episodes.


smollm2-1.7b-instruct β€” best of the SmolLMs, but not enough

Evidence

  • exact% = 7% (2/27 correct) β€” only SmolLM variant above zero
  • svc% = 63% β€” knows which service most tasks target
  • Picks up service names fairly often but almost always with wrong operation or flags

Weaknesses

  • A 34% accuracy gap to qwen2.5-coder-3b on the critical exact% metric
  • Frequent hallucinations: aws s3 mb s3://firehose-delivery/ --profile aws-dev-prod (made-up profile flag)

Verdict: not worth training. The post-SFT ceiling will be limited by the base model's sparse AWS knowledge.


smollm2-135m-instruct β€” surprising +xtr%, zero substance

Evidence

  • +xtr% = 59% β€” emits aws prefixed lines more often than half the larger SmolLMs
  • exact% = 0%, op% = 7% β€” complete syntax salad behind the prefix

Example outputs

  • aws s3 ls --bucket=/path/to/s3 -o /path/to/s3-output.json -n notifications (hallucinated flags for list-topics task)
  • aws elastic describe-cache-clusters --cluster=my_elastiCache (wrong service name, fabricated flags)

Verdict: it produces convincing-looking CLI syntax but none of it is valid. A completely different failure mode from the 360M models (which dump prose) β€” and equally useless.


smollm-360m-instruct / smollm-360m-instruct-v0.2 / smollm2-360m-instruct

All three fail similarly:

  • fmt% either 0% (dumps prose or Python code) or ~50% (emits quoted strings like "'aws s3 ls'")
  • exact% = 0% across the board
  • Outputs often include markdown code fences, step-by-step narration, or hallucinated boto3 code

Verdict: ineligible. Format instability makes SFT expensive and the base capability is absent.


smollm-1.7b-instruct-v0.2 β€” size doesn't save it

Evidence

  • Same parameter count as smollm2-1.7b-instruct but older / different training
  • +xtr% = 37% vs. 63% for smollm2-1.7b β€” the training difference matters more than scale
  • 0% exact match, 11% op match

Verdict: the newer smollm2-1.7b-instruct is strictly better; this variant has no role.


smollm2-360m (base, not instruct)

Evidence

  • 0% across every column
  • Echoes the prompt back verbatim

Verdict: base models without instruction tuning are architecturally wrong for a chat-format SFT setup. Skip.


deepseek-r1-distill-qwen-1.5b β€” wrong tool for this job

Original run (max_tokens=120)

  • 0% across the board, 0-char outputs
  • Cause: R1 models emit <think>...</think> reasoning blocks of 500-2000 tokens before their answer. 120 tokens truncated every response mid-thinking.

Re-run (max_tokens=2048)

  • exact% = 0/27 (still zero)
  • avg latency = 16.0s (2-3Γ— slower than qwen3-4b due to thinking overhead)
  • 2 calls timed out at 60s
  • Typical outputs: aws s3 bucket-create --bucket data-pipeline (invented op), aws s3 topic --name Alerts (wrong service), aws iam checkRolePolicy (hallucinated op name)

Why it fails

  • R1-distill was trained on math and coding reasoning β€” not AWS CLI
  • The <think> pattern doesn't summon domain knowledge that isn't in the base model
  • Qwen-1.5B's AWS knowledge is sparse; wrapping it in reasoning tokens doesn't add substance

Verdict: only useful if you specifically want GRPO-with-thinking from day one AND are willing to do heavier SFT. For this task, qwen2.5-coder-3b + emergent reasoning during GRPO (R1-Zero style) is the cleaner path.

6. How to read the gap between fmt% and +xtr%

This gap tells you what kind of SFT each model needs:

  • qwen/qwen3-4b-2507: fmt% = +xtr% = 100% β†’ zero format-locking needed, SFT can focus entirely on task correctness
  • qwen2.5-coder-3b: 85% β†’ 100% β†’ small format tax (quotes, trailing punctuation); one epoch of SFT fixes it
  • smollm-360m-instruct: 0% β†’ 63% β†’ the model knows what to say but always wraps it in prose. A regex post-processor could salvage 63% without any training β€” but it's cheap signal to SFT on
  • deepseek-r1-distill: 0% β†’ 0% β†’ format-broken even with reasoning budget; not recoverable by regex

7. Overall ranking (for SFT + GRPO)

Rank Model Train? Reasoning
1 qwen2.5-coder-3b-instruct βœ… Best exact%, best op%, cleanest output, fast enough for GRPO
2 qwen/qwen3-4b-2507 ⚠️ fallback Perfect format but 3Γ— slower and slightly worse content than #1
3 qwen2.5-coder-1.5b-instruct ⚠️ speed play Strong for its size; train only if GRPO throughput is critical
4 smollm2-1.7b-instruct ❌ 34pt gap on exact% vs #1; ceiling too low
β€” All smaller SmolLMs ❌ Format-broken, zero exact match, hallucinated syntax
β€” smollm-1.7b-instruct-v0.2 ❌ Strictly dominated by smollm2-1.7b-instruct
β€” deepseek-r1-distill-qwen-1.5b ❌ Wrong domain + latency 2Γ— worse than #2

8. Caveats & limitations

  • 27 prompts is a sample, not an exhaustive benchmark. The error bars on exact% are Β±5-10 percentage points. For close calls (like coder-3b vs qwen3-4b), rerun with --max-per-combo 5 or higher before making the final call.
  • LM Studio latency is serving-architecture-dependent. The 10s/call for qwen3-4b reflects Metal / llama.cpp on your local Mac. During actual training we'll run on CUDA via transformers (100ms forward pass) or vLLM (30ms), and the picture changes.
  • We only measure single-turn behavior. Multi-step task completion (does the model actually solve the episode end-to-end?) requires running against the live env. This eval predicts first-step performance, which correlates well but isn't the same thing.
  • R1-distill was tested twice β€” once with the default budget that truncated thinking, once with max_tokens=2048. The README table shows the truncated numbers; real performance is section 5's re-run.

9. Training implications β€” if you pick qwen2.5-coder-3b-instruct

  • LoRA: r=8, lora_alpha=16, target_modules=["q_proj","k_proj","v_proj","o_proj"], lora_dropout=0.05 β€” lower rank than the default because the base model is already strong
  • Training: num_train_epochs=2, lr=2e-4, effective_batch=16, max_seq_length=512, lr_scheduler="cosine" β€” shorter than the plan for Llama-3.1-8B; don't over-train
  • Expected post-SFT: fmt% β†’ 100%, op% β†’ 90%+, exact% β†’ 75%+
  • GRPO after SFT: ~3Γ— cheaper rollouts than qwen3-4b, so more exploration per compute budget

10. Files produced by this evaluation

11. How to rerun this evaluation post-SFT

After training, save the merged model to LM Studio and rerun:

.venv/bin/python data/eval_lm_studio_models.py \
    --max-per-combo 5 \
    --out data/sft/model_eval_postsft.json

Compare the exact% and op% deltas vs the baseline in model_eval_full.json. A successful SFT run should see:

  • exact%: 41% β†’ 75%+
  • op%: 63% β†’ 90%+
  • fmt%: 85% β†’ 100%

If those deltas don't land, something's wrong with the training β€” not the dataset.