aws_rl_env / data /README.md
Sizzing's picture
Upload folder using huggingface_hub
71e54ee verified

data/ β€” SFT Dataset Generation & Base-Model Selection

← back to main README

This directory holds the SFT training corpus, the dataset generator that produced it, and the rigorous benchmark we used to pick the base model. Together they answer two questions a hackathon judge should be able to verify in under five minutes:

  1. What did we train on? A 1,500-row synthetic SFT corpus with five trajectory types covering success, continuation, failure recovery, verification, and hint usage. (Β§1)
  2. Why this base model? A reproducible 11-model benchmark across 27 held-out prompts. Qwen2.5-Coder-3B-Instruct wins on every metric that matters. (Β§5)

Top 4 candidate models on the held-out benchmark


Table of contents

  1. SFT dataset generation
  2. Five trajectory types
  3. Tier weighting
  4. Dataset format & artifacts
  5. Base-model selection β€” overview
  6. Eval harness
  7. HuggingFace publishing
  8. Files in this directory

1. SFT dataset generation

data/build_sft_dataset.py β€” 27 KB, single-script generator.

Approach

The dataset is synthetically generated but grounded in canonical solutions extracted from our integration test suite. Two design decisions worth flagging to judges:

AST-based extraction, not pytest execution

Each tests_tasks/test_<tier>_tasks.py file has a top-level constant (WARMUP_COMMANDS, BEGINNER_COMMANDS, …) mapping task_id β†’ canonical AWS CLI command. We extract these via Python's ast module β€” we do not execute the test file. Reasons:

  1. pytest fixtures would spin up a MiniStack, hit AWS APIs, and add 30+ seconds of overhead per generation run.
  2. Static extraction is deterministic β€” no flake risk. The dataset is reproducible bit-for-bit given a seed.
  3. The canonical solutions are intentionally simple constant declarations that AST can parse without import side effects.

Plausible-output simulation

When generating multi-step continuations, we don't have a real MiniStack response to feed back into the user message β€” we have to fabricate one. The generator maps each AWS operation (list-buckets, create-table, describe-instances, …) to a JSON template, then interpolates the right resource names from the task. So an aws s3api list-buckets step in the user prompt history has output like:

{"Buckets":[{"Name":"my-app-data","CreationDate":"2026-04-15T..."}]}

…instead of the empty {"Buckets":[]} you'd get from a fresh MiniStack. This is the difference between the SFT model learning "first step, always answer with the canonical command" (degenerate) and "first step depends on what's already been done" (correct).

Dynamic-ID filtering

Some tests reference resources whose IDs only exist at runtime β€” security groups (sg-…), subnets (subnet-…), VPCs (vpc-…), instance IDs (i-…). These commands cannot be deterministically captured by static extraction. The generator skips any task whose canonical command contains those patterns. The result: 72 unique tasks make it into the train split (out of 134 total tasks), all of which are deterministically reproducible.


2. Five trajectory types

The SFT corpus mixes five distinct trajectory shapes so the model learns to handle real multi-turn agent behavior, not just one-shot question answering. Actual proportions (from data/sft/dataset_stats.json):

Source Train pct (target) Train rows What the model sees
success_first_step 55.1% (55%) 826 User β†’ Task description β†’ assistant emits the canonical command
multi_step_continuation 20.1% (20%) 301 User β†’ Task description + a baked-in history of N-1 prior commands and their outputs β†’ assistant emits step N
failure_recovery 15.5% (15%) 232 User β†’ Task description + step 1 of a wrong command and its simulated error β†’ assistant emits the recovery command
verification 4.5% (5%) 67 User β†’ Task already complete β†’ assistant emits a read-only verification command
hint_usage 4.9% (5%) 74 User β†’ Task description β†’ assistant emits aws help --task-hint (the agent action that requests a hint)

Why include the last four sources at all?

  • multi_step_continuation trains continuation behavior. Without it, the model overfits to step 1 and degrades on later turns.
  • failure_recovery teaches the model that a typo / wrong command is recoverable. The reward signal during GRPO is dense β€” the model needs to know what "try again" looks like.
  • verification trains the model to recognize when a task is done and respond appropriately. Production agents must distinguish "do something" from "confirm it's done".
  • hint_usage lets the model learn that aws help --task-hint is the in-environment way to request help, not just a literal CLI command.

3. Tier weighting

data/build_sft_dataset.py:54-60 β€” sampling weights:

Tier Weight Train rows Why
warmup 0.50 456 Most rows. Format-locks the model on the simplest possible "aws X list" pattern.
beginner 0.30 378 Single-resource creation β€” bread and butter.
intermediate 0.15 666 * Multi-step workflows. Note actual count > target because each task contributes more rows via multi_step_continuation.
advanced 0.05 0 Cross-service architectures. Filtered out post-extraction (most have dynamic IDs).
expert 0.00 0 SRE / drift / security-posture. Intentionally excluded from SFT.

Why expert tier is excluded from SFT. The expert tasks (drift detection, security audits) have randomized state checks β€” there is no canonical command sequence. Trying to SFT on them would teach the model a particular fix script that is wrong on most episodes. These tasks are reserved for GRPO, where the env's state_checks reward signal handles the randomization correctly.

* Intermediate row count exceeds the simple weight because the multi-step trajectory generator naturally produces multiple rows per task (one for step 1, step 2, etc.).


4. Dataset format & artifacts

JSONL chat-message schema

{
  "messages": [
    {"role": "system", "content": "You are an AWS cloud engineer interacting with a real AWS environment via CLI..."},
    {"role": "user", "content": "TASK: Create an S3 bucket named my-app-data and enable versioning on it.\n\nPREVIOUS COMMANDS:\n[1] $ aws s3 mb s3://my-app-data\n    output: make_bucket: my-app-data\n    reward: 0.50\n\n---\n\nCURRENT OBSERVATION:\nProgress: 0.50  Achieved: False  Step: 2"},
    {"role": "assistant", "content": "aws s3api put-bucket-versioning --bucket my-app-data --versioning-configuration Status=Enabled"}
  ],
  "difficulty": "intermediate",
  "source": "multi_step_continuation",
  "task_id": 42
}

Every row carries the difficulty, source, and task_id metadata β€” useful for filtering, ablations, and debugging.

Artifacts

data/sft/:

File Size Rows Unique tasks Use
aws_rl_sft.train.jsonl 2.2 MB 1,500 72 SFT training
aws_rl_sft.val.jsonl 218 KB 150 63 SFT validation; basis for MODEL_EVALUATION.md
aws_rl_sft.reserve.jsonl 294 KB 200 66 Held-out reserve for post-SFT regression checks
dataset_stats.json 3.4 KB β€” β€” Per-split source/tier/task breakdowns
MODEL_EVALUATION.md 15 KB β€” β€” Full model-selection writeup (Β§5)
model_eval_full.json 209 KB 297 β€” Per-call eval data (11 models Γ— 27 prompts)
deepseek_r1_rerun.json 5.3 KB 27 β€” DeepSeek R1 re-run with max_tokens=2048

5. Base-model selection β€” overview

This is the most rigorous decision in the whole project. Full reasoning, per-model verdicts, and methodology lives in data/sft/MODEL_EVALUATION.md β€” a 270-line standalone report. Read it before judging the project's technical depth; it's what convinces us we're training the right thing.

The 30-second summary:

Model exact% op% fmt% Latency Verdict
qwen2.5-coder-3b-instruct 41% 63% 85% 3.1s βœ… Train this. Highest exact, fastest viable.
qwen/qwen3-4b-2507 33% 59% 100% 10.4s Fallback. Perfect format, 3Γ— slower.
qwen2.5-coder-1.5b-instruct 22% 44% 81% 2.5s Speed play if GRPO budget tight.
smollm2-1.7b-instruct 7% 37% 63% 2.1s ❌ Ceiling too low.
(7 more) 0% … … … ❌ Format-broken or wrong domain.

Per-model comparison: 5 quality metrics + latency

What the metrics mean:

  • fmt%: raw output starts with aws (no preamble, fences, or quotes). The agent's inference.py:93 gate rejects everything else.
  • +xtr%: fmt% after stripping markdown fences. Gap to fmt% = "model knows the answer, wrapping it in junk".
  • exact%: extracted command matches canonical token-for-token. The hardest metric.
  • svc%: same AWS service as canonical. Domain orientation.
  • op%: same service AND operation. The gap SFT closes most reliably.

The full table (11 models, 9 metrics, per-call logs) is in data/sft/model_eval_full.json β€” 297 records.


6. Eval harness

data/eval_lm_studio_models.py β€” 9.9 KB, reusable.

  • Calls each chat model loaded in LM Studio at http://localhost:1234/v1/chat/completions (OpenAI-compatible API)
  • Sends the same 27 held-out prompts to each model
  • Extracts aws ... from the response (stripping fences / preamble)
  • Compares against the canonical command from the val split
  • Writes per-call detail + aggregate metrics to JSON

To re-run post-SFT:

.venv/bin/python data/eval_lm_studio_models.py \
    --max-per-combo 5 \
    --out data/sft/model_eval_postsft.json

A successful SFT run should see (predictions from MODEL_EVALUATION.md Β§11, and actuals from our reference SFT run):

Metric Base Target Actual (post-SFT)
exact% 39% 75%+ 88.9% βœ…
op% 61% 90%+ 88.9% β‰ˆ
svc% 78% β€” 88.9%
fmt% 33% 100% 100.0% βœ…
latency 2.03s β€” 1.40s (faster)

Every target from MODEL_EVALUATION.md is hit or essentially hit. Format compliance is now perfect; exact-match jumped 50 pp; the model is faster and tighter.

Base vs SFT comparison (eval metrics) Single-step eval base vs SFT


7. HuggingFace publishing

data/upload_sft_to_hf.py β€” pushes the JSONL splits to HuggingFace Hub:

Split Hub repo
train Sizzing/aws-rl-sft-qwen25coder3b-train
val Sizzing/aws-rl-sft-qwen25coder3b-val
reserve Sizzing/aws-rl-sft-qwen25coder3b-reserve

The trained SFT adapter (output of train/train_sft_lora.ipynb) is published separately at:

  • Sizzing/aws-rl-sft-qwen25coder3b-adapter

GRPO training picks it up by setting SFT_ADAPTER = "Sizzing/aws-rl-sft-qwen25coder3b-adapter" in aws_rl_env_colab.ipynb.


8. Files in this directory

File Purpose
build_sft_dataset.py Generator β€” AST extraction + 5 trajectory types + plausible outputs
eval_lm_studio_models.py Base-model benchmark harness (LM Studio API)
upload_sft_to_hf.py Push the SFT splits to HuggingFace
sft/aws_rl_sft.train.jsonl 1,500 SFT training rows
sft/aws_rl_sft.val.jsonl 150 validation rows
sft/aws_rl_sft.reserve.jsonl 200 reserve rows
sft/dataset_stats.json Per-split source / tier / task counts
sft/MODEL_EVALUATION.md The base-model selection report (read this)
sft/model_eval_full.json Per-call eval data (11 models Γ— 27 prompts)
sft/deepseek_r1_rerun.json R1 re-run with extended max_tokens

See also