Spaces:

Sizzing
/

aws_rl_env

Running

App Files Files Community

aws_rl_env / data /README.md

Sizzing

Upload folder using huggingface_hub

71e54ee verified 13 days ago

preview code

raw

history blame contribute delete

16 kB

`data/` — SFT Dataset Generation & Base-Model Selection

← back to main README

This directory holds the SFT training corpus, the dataset generator that produced it, and the rigorous benchmark we used to pick the base model. Together they answer two questions a hackathon judge should be able to verify in under five minutes:

What did we train on? A 1,500-row synthetic SFT corpus with five trajectory types covering success, continuation, failure recovery, verification, and hint usage. (§1)
Why this base model? A reproducible 11-model benchmark across 27 held-out prompts. Qwen2.5-Coder-3B-Instruct wins on every metric that matters. (§5)

SFT dataset generation
Five trajectory types
Tier weighting
Dataset format & artifacts
Base-model selection — overview
Eval harness
HuggingFace publishing
Files in this directory

1. SFT dataset generation

data/build_sft_dataset.py — 27 KB, single-script generator.

Approach

The dataset is synthetically generated but grounded in canonical solutions extracted from our integration test suite. Two design decisions worth flagging to judges:

AST-based extraction, not pytest execution

Each tests_tasks/test_<tier>_tasks.py file has a top-level constant (WARMUP_COMMANDS, BEGINNER_COMMANDS, …) mapping task_id → canonical AWS CLI command. We extract these via Python's ast module — we do not execute the test file. Reasons:

pytest fixtures would spin up a MiniStack, hit AWS APIs, and add 30+ seconds of overhead per generation run.
Static extraction is deterministic — no flake risk. The dataset is reproducible bit-for-bit given a seed.
The canonical solutions are intentionally simple constant declarations that AST can parse without import side effects.

Plausible-output simulation

When generating multi-step continuations, we don't have a real MiniStack response to feed back into the user message — we have to fabricate one. The generator maps each AWS operation (list-buckets, create-table, describe-instances, …) to a JSON template, then interpolates the right resource names from the task. So an aws s3api list-buckets step in the user prompt history has output like:

{"Buckets":[{"Name":"my-app-data","CreationDate":"2026-04-15T..."}]}

…instead of the empty {"Buckets":[]} you'd get from a fresh MiniStack. This is the difference between the SFT model learning "first step, always answer with the canonical command" (degenerate) and "first step depends on what's already been done" (correct).

Dynamic-ID filtering

Some tests reference resources whose IDs only exist at runtime — security groups (sg-…), subnets (subnet-…), VPCs (vpc-…), instance IDs (i-…). These commands cannot be deterministically captured by static extraction. The generator skips any task whose canonical command contains those patterns. The result: 72 unique tasks make it into the train split (out of 134 total tasks), all of which are deterministically reproducible.

2. Five trajectory types

The SFT corpus mixes five distinct trajectory shapes so the model learns to handle real multi-turn agent behavior, not just one-shot question answering. Actual proportions (from data/sft/dataset_stats.json):

Source	Train pct (target)	Train rows	What the model sees
`success_first_step`	55.1% (55%)	826	User → Task description → assistant emits the canonical command
`multi_step_continuation`	20.1% (20%)	301	User → Task description + a baked-in history of N-1 prior commands and their outputs → assistant emits step N
`failure_recovery`	15.5% (15%)	232	User → Task description + step 1 of a wrong command and its simulated error → assistant emits the recovery command
`verification`	4.5% (5%)	67	User → Task already complete → assistant emits a read-only verification command
`hint_usage`	4.9% (5%)	74	User → Task description → assistant emits `aws help --task-hint` (the agent action that requests a hint)

Why include the last four sources at all?

multi_step_continuation trains continuation behavior. Without it, the model overfits to step 1 and degrades on later turns.
failure_recovery teaches the model that a typo / wrong command is recoverable. The reward signal during GRPO is dense — the model needs to know what "try again" looks like.
verification trains the model to recognize when a task is done and respond appropriately. Production agents must distinguish "do something" from "confirm it's done".
hint_usage lets the model learn that aws help --task-hint is the in-environment way to request help, not just a literal CLI command.

3. Tier weighting

data/build_sft_dataset.py:54-60 — sampling weights:

Tier	Weight	Train rows	Why
warmup	0.50	456	Most rows. Format-locks the model on the simplest possible "aws X list" pattern.
beginner	0.30	378	Single-resource creation — bread and butter.
intermediate	0.15	666 *	Multi-step workflows. Note actual count > target because each task contributes more rows via multi_step_continuation.
advanced	0.05	0	Cross-service architectures. Filtered out post-extraction (most have dynamic IDs).
expert	0.00	0	SRE / drift / security-posture. Intentionally excluded from SFT.

Why expert tier is excluded from SFT. The expert tasks (drift detection, security audits) have randomized state checks — there is no canonical command sequence. Trying to SFT on them would teach the model a particular fix script that is wrong on most episodes. These tasks are reserved for GRPO, where the env's state_checks reward signal handles the randomization correctly.

* Intermediate row count exceeds the simple weight because the multi-step trajectory generator naturally produces multiple rows per task (one for step 1, step 2, etc.).

4. Dataset format & artifacts

JSONL chat-message schema

{
  "messages": [
    {"role": "system", "content": "You are an AWS cloud engineer interacting with a real AWS environment via CLI..."},
    {"role": "user", "content": "TASK: Create an S3 bucket named my-app-data and enable versioning on it.\n\nPREVIOUS COMMANDS:\n[1] $ aws s3 mb s3://my-app-data\n    output: make_bucket: my-app-data\n    reward: 0.50\n\n---\n\nCURRENT OBSERVATION:\nProgress: 0.50  Achieved: False  Step: 2"},
    {"role": "assistant", "content": "aws s3api put-bucket-versioning --bucket my-app-data --versioning-configuration Status=Enabled"}
  ],
  "difficulty": "intermediate",
  "source": "multi_step_continuation",
  "task_id": 42
}

Every row carries the difficulty, source, and task_id metadata — useful for filtering, ablations, and debugging.

Artifacts

data/sft/:

File	Size	Rows	Unique tasks	Use
aws_rl_sft.train.jsonl	2.2 MB	1,500	72	SFT training
aws_rl_sft.val.jsonl	218 KB	150	63	SFT validation; basis for MODEL_EVALUATION.md
aws_rl_sft.reserve.jsonl	294 KB	200	66	Held-out reserve for post-SFT regression checks
dataset_stats.json	3.4 KB	—	—	Per-split source/tier/task breakdowns
MODEL_EVALUATION.md	15 KB	—	—	Full model-selection writeup (§5)
model_eval_full.json	209 KB	297	—	Per-call eval data (11 models × 27 prompts)
deepseek_r1_rerun.json	5.3 KB	27	—	DeepSeek R1 re-run with `max_tokens=2048`

5. Base-model selection — overview

This is the most rigorous decision in the whole project. Full reasoning, per-model verdicts, and methodology lives in data/sft/MODEL_EVALUATION.md — a 270-line standalone report. Read it before judging the project's technical depth; it's what convinces us we're training the right thing.

The 30-second summary:

Model	exact%	op%	fmt%	Latency	Verdict
qwen2.5-coder-3b-instruct	41%	63%	85%	3.1s	✅ Train this. Highest exact, fastest viable.
qwen/qwen3-4b-2507	33%	59%	100%	10.4s	Fallback. Perfect format, 3× slower.
qwen2.5-coder-1.5b-instruct	22%	44%	81%	2.5s	Speed play if GRPO budget tight.
smollm2-1.7b-instruct	7%	37%	63%	2.1s	❌ Ceiling too low.
(7 more)	0%	…	…	…	❌ Format-broken or wrong domain.

What the metrics mean:

fmt%: raw output starts with aws (no preamble, fences, or quotes). The agent's inference.py:93 gate rejects everything else.
+xtr%: fmt% after stripping markdown fences. Gap to fmt% = "model knows the answer, wrapping it in junk".
exact%: extracted command matches canonical token-for-token. The hardest metric.
svc%: same AWS service as canonical. Domain orientation.
op%: same service AND operation. The gap SFT closes most reliably.

The full table (11 models, 9 metrics, per-call logs) is in data/sft/model_eval_full.json — 297 records.

6. Eval harness

data/eval_lm_studio_models.py — 9.9 KB, reusable.

Calls each chat model loaded in LM Studio at http://localhost:1234/v1/chat/completions (OpenAI-compatible API)
Sends the same 27 held-out prompts to each model
Extracts aws ... from the response (stripping fences / preamble)
Compares against the canonical command from the val split
Writes per-call detail + aggregate metrics to JSON

To re-run post-SFT:

.venv/bin/python data/eval_lm_studio_models.py \
    --max-per-combo 5 \
    --out data/sft/model_eval_postsft.json

A successful SFT run should see (predictions from MODEL_EVALUATION.md §11, and actuals from our reference SFT run):

Metric	Base	Target	Actual (post-SFT)
`exact%`	39%	75%+	88.9% ✅
`op%`	61%	90%+	88.9% ≈
`svc%`	78%	—	88.9%
`fmt%`	33%	100%	100.0% ✅
latency	2.03s	—	1.40s (faster)

Every target from MODEL_EVALUATION.md is hit or essentially hit. Format compliance is now perfect; exact-match jumped 50 pp; the model is faster and tighter.

7. HuggingFace publishing

data/upload_sft_to_hf.py — pushes the JSONL splits to HuggingFace Hub:

Split	Hub repo
train	`Sizzing/aws-rl-sft-qwen25coder3b-train`
val	`Sizzing/aws-rl-sft-qwen25coder3b-val`
reserve	`Sizzing/aws-rl-sft-qwen25coder3b-reserve`

The trained SFT adapter (output of train/train_sft_lora.ipynb) is published separately at:

Sizzing/aws-rl-sft-qwen25coder3b-adapter

GRPO training picks it up by setting SFT_ADAPTER = "Sizzing/aws-rl-sft-qwen25coder3b-adapter" in aws_rl_env_colab.ipynb.

8. Files in this directory

File	Purpose
build_sft_dataset.py	Generator — AST extraction + 5 trajectory types + plausible outputs
eval_lm_studio_models.py	Base-model benchmark harness (LM Studio API)
upload_sft_to_hf.py	Push the SFT splits to HuggingFace
sft/aws_rl_sft.train.jsonl	1,500 SFT training rows
sft/aws_rl_sft.val.jsonl	150 validation rows
sft/aws_rl_sft.reserve.jsonl	200 reserve rows
sft/dataset_stats.json	Per-split source / tier / task counts
sft/MODEL_EVALUATION.md	The base-model selection report (read this)
sft/model_eval_full.json	Per-call eval data (11 models × 27 prompts)
sft/deepseek_r1_rerun.json	R1 re-run with extended `max_tokens`

Spaces:

Sizzing
/

aws_rl_env

Running

`data/` — SFT Dataset Generation & Base-Model Selection

Table of contents

1. SFT dataset generation

Approach

AST-based extraction, not pytest execution

Plausible-output simulation

Dynamic-ID filtering

2. Five trajectory types

3. Tier weighting

4. Dataset format & artifacts

JSONL chat-message schema

Artifacts

5. Base-model selection — overview

6. Eval harness

7. HuggingFace publishing

8. Files in this directory

See also

data/ — SFT Dataset Generation & Base-Model Selection

Table of contents

1. SFT dataset generation

Approach

AST-based extraction, not pytest execution

Plausible-output simulation

Dynamic-ID filtering

2. Five trajectory types

3. Tier weighting

4. Dataset format & artifacts

JSONL chat-message schema

Artifacts

5. Base-model selection — overview

6. Eval harness

7. HuggingFace publishing

8. Files in this directory

See also

`data/` — SFT Dataset Generation & Base-Model Selection