Spaces:
Running
title: AWS RL Environment Server
emoji: π₯
colorFrom: pink
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
AWS Cloud Operations β RL Environment & Training Pipeline
Cloud agents fail in production not because they donβt know the commands β but because state drifts, services hiccup, and reward signals get gamed. We built an environment that simulates all three: 120+ AWS tasks under chaos and drift, an 8-layer anti-reward-hacking stack, and an adversarial curriculum that targets the agentβs own weak spots. After SFT β GRPO on a single GPU with 8 parallel rollouts, format compliance hit 100%, exact-match jumped 39% β 89%, and intermediate-tier success climbed 81% β 87%.
| Live demo | sizzing-aws-rl-env.hf.space/web β try the playground in a browser |
| API docs | sizzing-aws-rl-env.hf.space/docs (Swagger), /redoc |
| HF Space | huggingface.co/spaces/Sizzing/aws_rl_env |
| SFT adapter | Sizzing/aws-rl-sft-qwen25coder3b-adapter |
| Dataset | Sizzing/aws-rl-sft |
Table of contents
- What this is & why it matters
- Highlights β full feature inventory
- Architecture
- Live demo & Quick Start
- Run on Colab
- Action / Observation spec
- Curriculum & Reward (overview)
- Training pipeline (SFT β GRPO)
- Parallel rollout architecture
- MiniStack: vendored & customized
- Results & Benchmarks
- Repository map
- Configuration & Running
- Testing
- Tech stack
- Links
- Acknowledgments
1. What this is & why it matters
Modern AI agents are increasingly asked to operate cloud infrastructure β provisioning resources, fixing misconfigurations, responding to drift. Training such agents needs (a) a realistic environment, (b) reliable reward signals, and (c) enough scale to make RL feasible. Existing options force a hard tradeoff: real AWS costs hundreds of dollars per training run and is impossible to reset; toy emulators don't behave like production AWS.
This project closes that gap. We built:
- An OpenEnv-compatible RL environment that speaks real AWS CLI semantics. The agent sends
aws s3 mb β¦,aws iam create-role β¦, and so on β the exact same commands a human SRE would type. - A vendored, customized MiniStack simulator that responds with production-equivalent JSON, runs locally for zero cost, supports 34 AWS services, and exposes a single-call state-introspection endpoint we added so the grader has cheap ground-truth access.
- A 120+ task curriculum across 5 tiers (warmup β expert) with adaptive selection, mastery tracking, spaced repetition, chaos injection, and drift-detection scenarios β every feature designed to keep the reward signal honest and prevent the agent from gaming it.
- A complete SFT β GRPO training pipeline. A 1,500-row synthetic dataset spanning 5 trajectory shapes, an 11-model base benchmark, LoRA fine-tuning, and TRL GRPO with multi-turn rollouts and Optuna hyperparameter search.
- An 8-way parallel-rollout architecture. Server-side MiniStack pool, client-side
GrpoPool, in-processMultiTurnEnvPoolβ three coordinated layers that let G=8 concurrent rollouts run on one GPU without state contamination.
Everything is reproducible: the dataset is generated by a deterministic script, the model selection is documented end-to-end, training entry points run on Colab, and the env runs locally in a single Docker container with no external network requirement.
2. Highlights β full feature inventory
This is the complete surface area of the project. Each entry links to deeper documentation in the corresponding sub-README.
Environment & Curriculum
- 120+ tasks across 5 tiers β warmup (25), beginner (25), intermediate (25), advanced (25), expert (24), drift (9). YAML-defined task spec per tier.
- Curriculum learning with priority scoring β
score = novelty + weakness β recency + spaced_rep_bonusdrives task selection. - Mastery tracking β sliding 10-episode window, 0.7 threshold, 0.85 exponential decay, supports un-graduation.
- Spaced repetition β graduated tasks resurface at intervals
[3, 6, 12, 24, 48]to prevent forgetting. - Tier promotion β standard (min episodes + success rate) + fast-track (3 consecutive β₯90% episodes).
- Strategy pattern: simulator vs real AWS β
BACKEND_TYPE=simulator(default) oraws, no code fork.
Reward shaping
- Five grading strategies β command-match (warmup), resource-creation (beginner), multi-step (intermediate), multi-step+services (advanced), state-checks (expert).
- Dense partial-progress signal β clamped to
[0.0, 0.99],1.0reserved for verified completion. - Rollback penalty β
β0.1per(create-X, β¦, delete-X)pair. - Idempotency bonus β
+0.02for graceful "already exists" retry. - Hint decay β three-level progressive hints with
0.85^nreward multiplier. - Chaos survival bonus β
Γ1.05if the agent completes a chaotic task.
Resilience & adversarial features
- Chaos injection β silent mid-episode mutations, tier-scaled probabilities (10/20/30%) on services the task is touching.
- Drift detection β 6 expert tasks, 2β3 random mutations from a per-task pool, randomized per episode (no memorization).
- Security-posture audit tasks β S3 public bucket lockdown, IAM least-privilege, Lambda secret rotation.
- 8-layer anti-reward-hacking β ground-truth verification, dedup, grader invisibility, command allow-list, no-credit-for-reads, monotonic progress, exact resource-name validation, final state checks.
Training pipeline
- Synthetic SFT dataset (1,500 rows) β 5 trajectory types: success / multi-step continuation / failure recovery / verification / hint usage.
- Rigorous base-model selection β 11 models Γ 27 prompts, Qwen2.5-Coder-3B-Instruct wins.
- LoRA SFT β
r β {8,16,32},lora_alpha = r Γ multiplier, attention-only adaptation. - GRPO RL via TRL β group-relative advantages, KL to SFT reference,
dapoloss, no critic. - Multi-turn rollouts β up to
MAX_TURNS=6, observation fed back as next-turn user message. - Optuna hyperparameter search β TPE sampler over 8-dim space, frozen held-out validation set.
- HuggingFace integration β adapter + dataset published to Hub, OpenEnv Space deployment.
Parallel rollout architecture
- Server-side MiniStack pool β
MiniStackPool(server/app.py), free-list of ports, lock-guarded acquire/release. - Client-side GrpoPool β async-native, all-or-nothing connect, asyncio.gather for concurrent rollouts.
- In-process MultiTurnEnvPool β sync API, owns a background asyncio loop, used by the trainer.
- 8 isolated rollouts on one server β proof in scripts/TestMultipleConnects.ipynb.
Vendored simulator
- MiniStack as git subtree β vendored at aws_infra/ (commit
2c38c0b). 34 AWS services. MIT. - Custom
/_ministack/stateendpoint β added in commita648c3a; returns full infra inventory in one call. - Upstream sync workflow β periodic
git subtree pull; isolated patches keep conflicts minimal.
Operations & deployment
- OpenEnv-compliant β
/reset,/step,/state,/schema,/wsHTTP+WebSocket endpoints. - Web playground UI β
/webroute, 40 AWS service icons, Jinja2 + JS frontend. - Docker-first deployment β multi-stage build, container ships server + N MiniStack instances + AWS CLI.
- Comprehensive test suite β 10 unit tests + 6 tier-integration suites covering 134 tasks.
3. Architecture
βββββββββββββββββββββββββββββββββββ Docker container βββββββββββββββββββββββββββββββββββ
β β
β FastAPI server (port 8000) β
β βββ OpenEnv router /reset /step /state /schema /ws /health β
β βββ Web playground /web (Jinja2 + 40 AWS icon SVGs) β
β βββ env_factory per-WS-session AwsRlEnvironment instance β
β β (acquires a MiniStack port from MiniStackPool) β
β βββ Services β
β Curriculum Β· TaskGrader Β· ResourceVerifier Β· ChaosEngine Β· DriftEngine β
β HintProvider Β· EpisodeTracker Β· EnvironmentDesigner Β· EnvironmentStrategy β
β β
β β
β MiniStack instances :4566 :4567 :4568 β¦ :4566+POOL_SIZE-1 β
β (vendored at aws_infra/, started by the Dockerfile entrypoint) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β²
β HTTP/WS β AWS CLI subprocess
β β (AWS_ENDPOINT_URL=http://localhost:4566+i)
β β
βββββββββ΄ββββββββββββ βββββββββ΄ββββββββββββ
β RL Agent β β AWS CLI commands β
β the agent emits β β (client.py) β
βββββββββββββββββββββ βββββββββββββββββββββ
Episode lifecycle
reset()β wipes simulator state, picks next task from the curriculum, runssetup_commands, applies drift if applicable, returns initial observation.step(action)β validates the command (must start withaws), intercepts hint requests, executes via the strategy, records in tracker, grades with shaped reward, optionally injects chaos, returns observation.- Hint β agent sends
aws help --task-hint; intercepted before reaching MiniStack; returns next-level hint, incrementshints_used(which decays final reward by0.85^n). - Termination β
task_achieved=Trueorstep_count >= MAX_STEPS(default 15).
Full mechanics in At server/README.md file.
4. Live demo & Quick Start
Try it in a browser
The hosted playground lets you click around any task without writing code:
Python client
from aws_rl_env import AwsRlAction, AwsRlEnv
with AwsRlEnv.from_docker_image("aws-rl-env:latest") as env:
result = env.reset()
print(f"Task: {result.observation.task.description}")
result = env.step(AwsRlAction(command="aws s3 mb s3://my-bucket"))
print(f"Reward: {result.reward}, Done: {result.done}")
Or against a running server:
env = AwsRlEnv(base_url="http://localhost:8000")
result = env.reset()
result = env.step(AwsRlAction(command="aws s3 ls"))
WebSocket API
import websockets, json
async with websockets.connect("wss://sizzing-aws-rl-env.hf.space/ws") as ws:
await ws.send(json.dumps({"type": "reset"}))
obs = json.loads(await ws.recv())
await ws.send(json.dumps({"type": "step", "data": {"command": "aws s3 ls"}}))
obs = json.loads(await ws.recv())
Local Docker
make docker-build # build the image
make docker-run # foreground; serves on :8000
make docker-run-detach # background
make docker-health # liveness probe
For training (8-way parallel rollouts):
AWS_RL_ENV_POOL_SIZE=8 make run
5. Run on Colab
The full pipeline is reproducible on a Colab GPU runtime. Drop your token into Colab Secrets, set ENV_BASE_URL to your HF Space (or local with ngrok), and run.
| Notebook | What it does | Open in Colab |
|---|---|---|
| train/train_sft_lora.ipynb | Stage 1 β SFT LoRA fine-tuning of Qwen2.5-Coder-3B | https://colab.research.google.com/drive/1dm9sDaLxHX6s9zEG_SC0FQcKWKkc3TfL?usp=sharing |
| train/train_grpo_lora.ipynb | Stage 2 β GRPO RL training with multi-turn rollouts | https://colab.research.google.com/drive/1NwiOM0h_JpXXGRxfY_xZtDiaigvIaKjx?usp=sharing |
| compare/compare_base_vs_sft.ipynb | Side-by-side: base model vs SFT adapter (dataset + RL env) | https://colab.research.google.com/drive/17406aiad8h4nAphV42vVNZ-a5SzZMIre?usp=sharing |
Replace each <!-- TODO --> with the Colab badge URL once published.
6. Action / Observation spec
The full Pydantic data models β kept inline so any reader can wire up an agent without leaving this page. Source: models.py.
Action
class AwsRlAction(Action):
command: str # AWS CLI command, e.g. "aws s3 ls"
The environment validates that command starts with aws ; anything else is rejected with success=False.
Observation
class AwsRlObservation(Observation):
episode_id: EpisodeID
step_count: StepCount
command_success: bool # exit code == 0
command_output: str # stdout from the AWS CLI invocation
error: str # stderr (empty if success)
task: TaskInfo | None # masked task definition (no success criteria)
task_achieved: bool
partial_progress: float # current task progress in [0.0, 1.0]
hints_used: int # cumulative hint count this episode
hint_text: str # most recent hint text (if any)
State
class AwsRlState(State):
current_task: Task | None # full task assigned for the episode
tracker: TrackerState # episode tracker snapshot
infra_state: dict # AWS infrastructure state keyed by service name
chaos_occurred: bool # whether chaos was injected this episode
current_tier: str # agent's current difficulty tier
class TrackerState:
step_count: int # steps taken this episode
hints_used: int # hints requested this episode
progress: float # current partial progress [0.0, 1.0]
commands_executed: list[str] # commands executed this episode
credited_operations: list[str] # (operation, resource) pairs that earned credit
Task definitions
class Task:
task_id: TaskID
difficulty: TaskDifficulty # warmup | beginner | intermediate | advanced | expert
description: str # human-readable goal
success_criteria: SuccessCriteria
setup_commands: list[SetupCommand] # pre-provision for SRE tasks
desired_state_spec: str | None # natural-language desired end state (drift tasks)
possible_drifts: list[SetupCommand] # pool of mutations for DriftEngine
class TaskInfo:
"""Agent-visible subset of Task β masks success_criteria, setup_commands, and possible_drifts."""
task_id: TaskID
difficulty: TaskDifficulty
description: str
desired_state_spec: str | None
class SuccessCriteria:
command_contains: str | None # warmup/beginner
operation: str | None # warmup/beginner
resource_exists: ResourceExistsCheck | None # beginner
steps: list[StepCriteria] # intermediate/advanced/expert
services: list[AwsService] # advanced/expert
state_checks: list[StateCheck] # expert (ground truth)
Curriculum config
class TierConfig:
min_episodes: int # minimum episodes before promotion
advance_rate: float # tier success rate threshold (0.6 - 1.0)
mastery_window: int # sliding window size (default: 10)
mastery_threshold: float # per-task graduation threshold (default: 0.7)
fast_track_rate: float # early promotion threshold (default: 0.9)
chaos_probability: float # probability of chaos injection per step
class SpacedRepState:
interval: int # episodes until next re-test (3 β 48)
last_graduated_episode: int # when last graduated
7. Curriculum & Reward (overview)
The curriculum and reward stack is the heart of the project. This section is the elevator pitch; the full mechanics β priority scoring math, anti-reward-hacking layers, chaos engine, drift engine β live in server/README.md.
Priority scoring (one-formula task selection)
score = novelty_bonus # +100 if never attempted
+ weakness_weight # +50 Γ (1 β task_success_rate)
+ spaced_rep_bonus # +30 if a graduated task is "due" for re-test
β recency_penalty # β20 if attempted in the last 2 episodes
Exploration, weakness-targeting, anti-forgetting, and variety β all balanced by one weighted sum.
Reward shaping
if task_achieved:
reward = 1.0
if survived_chaos: reward *= 1.05 # chaos survival bonus
else:
reward = partial_progress * 0.8 # β€ 0.8 from steps alone
if progress_increased: reward += 0.1 # dense progress signal
if command_failed: reward *= 0.5 # error penalty
reward -= 0.1 * rollback_count # waste penalty
reward += 0.02 * idempotent_retries # graceful retry bonus
reward = clamp(reward, 0.0, 0.99) # 1.0 reserved for completion
reward *= 0.85 ** hints_used # hint decay applied last
The agent's loss surface is intentionally narrow: only doing the task earns full reward, and every reward-hacking shortcut we identified during design has a defense layer (full list in Server's Readme file section Β§9).
8. Training pipeline (SFT β GRPO)
The training pipeline runs in two stages, both reproducible on Colab. Full detail in train/README.md.
βββββββββββ data/sft/ βββββββββββ
β 1,500 train Β· 150 val rows β
β 5 trajectory types β
βββββββββββββββββ¬ββββββββββββββββ
βΌ
STAGE 1 β Supervised Fine-Tuning train/train_sft_lora.ipynb
Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) β SFT adapter
β
β Sizzing/aws-rl-sft-qwen25coder3b-adapter
βΌ
STAGE 2 β GRPO RL train/train_grpo_lora.ipynb
G=8 parallel rollouts Β· multi-turn Β· reward = env return
Optuna over (lr, Ξ², G, T, top_p, lora_r, max_turns)
Numbers worth knowing
| Base model | unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit β picked via Through model evaluation |
| SFT LoRA | r β {8,16,32}, lora_alpha = r Γ multiplier, target = attention only, dropout [0.005, 0.031] |
| GRPO config | G=8, Ξ²=0.04, lr=5e-6, T=0.9, top_p=0.95, max_turns=6, loss=dapo |
| Optuna search | TPE sampler, 6 trials Γ 30 GRPO steps, frozen 10-task held-out val set |
| Final training | 200 GRPO steps with best config |
Training graphs
9. Parallel rollout architecture
GRPO needs G rollouts on the same task per training step. We run all G in parallel with state isolation guaranteed. Three coordinated pool layers make it work:
Trainer (G=8 generations needed per step)
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
βΌ βΌ βΌ
MultiTurnEnvPool GrpoPool (in-process)
(train_grpo.py) (scripts/grpo_pool.py)
sync API async API
β β
ββββββββ 8 WebSocket connections βββββββββ
β
βΌ
FastAPI server :8000
+ OpenEnv max_concurrent_envs=8
β
βΌ
MiniStackPool (free-list, lock-guarded)
acquire(port) on connect, release on disconnect
β
βΌ
8 isolated MiniStack instances :4566..:4573
Wall-clock impact: an 8-rollout Γ 6-turn episode runs in ~300 ms of env time vs ~2.4 s sequential. Full mechanics, including the all-or-nothing connect protocol that prevents pool-slot leakage on flake, are in Scripts README file.
10. MiniStack: vendored & customized
The simulator powering the env is vendored as a git subtree at aws_infra/, not pulled as a black-box dependency. We forked it because we needed:
- A custom
/_ministack/stateJSON endpoint so the grader can read the entire infra inventory in one HTTP call instead of iterating 20+ list APIs per grading pass. Added in commita648c3a "feat: Add support for service state retrieval and action listing across multiple AWS services". - A reproducible build with no runtime network requirement β the Docker image bundles a specific MiniStack revision.
- The freedom to extend service coverage on demand.
Custom commits live as small, isolated patches so periodic upstream syncs (af2e945, 579597b) replay cleanly. To inspect:
git show a648c3a # the state-endpoint diff
git log --oneline -- aws_infra/ # only the aws_infra subtree history
Full subtree workflow + commit-by-commit detail in server/README.md Β§5. Upstream MiniStack docs (81 KB) are preserved at aws_infra/README.md.
11. Results & Benchmarks
Base-model selection
We evaluated 11 chat models on 27 held-out prompts. Qwen2.5-Coder-3B-Instruct wins on every metric that matters: 41% exact match (highest), 63% operation match (highest), 3.1 s/call (3Γ faster than the 4B runner-up). Full report:
data/sft/MODEL_EVALUATION.md β 270-line writeup, per-model verdicts, methodology
Base vs SFT β actual results
After running the SFT pipeline end-to-end, the eval delta on the same held-out prompts is striking:
| Metric | Base | Post-SFT | Delta |
|---|---|---|---|
format_pct |
33.3% | 100.0% | +66.7 pp |
exact_pct |
38.9% | 88.9% | +50.0 pp |
service_pct |
77.8% | 88.9% | +11.1 pp |
operation_pct |
61.1% | 88.9% | +27.8 pp |
avg_len |
85.8 | 74.7 | β11 chars (tighter) |
Every target from data/sft/MODEL_EVALUATION.md Β§11 is met or exceeded. Format compliance is now perfect; the model never wraps commands in fences or quotes after SFT. Exact-match jumped from 39% to 89% β the agent now emits the canonical command for ~9 of every 10 prompts.
The richer two-mode benchmark (dataset eval + live RL env eval) is in compare/compare_base_vs_sft.ipynb; methodology in compare/README.md.
SFT training curves
Optuna SFT search
The best SFT trial (out of 6) used lora_r=16, lora_alpha=16, dropout=0.0058, lr=4.03e-4, warmup=0.1 β see train/README.md Β§3 for the full Optuna study table.
GRPO results (live multi-step env eval)
After 35 GRPO steps on top of the SFT adapter (best Optuna config: lr=1.6e-5, Ξ²=0.0021, T=0.99), we re-evaluated end-to-end on 100+ episodes:
| Metric | Base + SFT | Base + SFT + GRPO | Ξ |
|---|---|---|---|
| Overall success rate | 86.8% | 86.2% | β0.5 pp |
| Overall mean reward | 0.883 | 0.877 | β0.006 |
| Beginner success | 96.2% | 100.0% | +3.8 pp |
| Intermediate success | 81.0% | 87.0% | +6.0 pp |
| Warmup success | 96.0% | 90.2% | β5.8 pp |
| Expert success | 22.2% | 22.2% | flat |
| Drift repair rate | 22.2% | 22.2% | flat |
| Destructive-action fail rate | 15.1% | 14.7% | β0.4 pp |
| Steps to solve | 1.45 | 1.55 | +0.10 |
Honest reading: the 35-step GRPO run preserves the SFT gains and modestly improves the middle tiers (beginner +3.8 pp, intermediate +6.0 pp) β but does not crack the expert-tier bottleneck (22% success on SRE / drift / security-posture tasks). With longer GRPO runs and more curriculum exposure to expert tasks, this is the next gain to chase.
GRPO training curves
Per-step training signals from the final 35-step GRPO run:
Optuna search across 4 trials picked the final config:
Qualitative rollouts (post-GRPO)
One sample episode per tier:
12. Repository map
| Path | Purpose | Sub-README |
|---|---|---|
| server/ | OpenEnv FastAPI server, env logic, services, web playground | server/README.md |
| train/ | SFT and GRPO training notebooks | train/README.md |
| data/ | SFT dataset, base-model selection, eval harness | data/README.md Β· MODEL_EVALUATION.md |
| compare/ | Base vs SFT side-by-side benchmark | compare/README.md |
| scripts/ | Parallel-rollout architecture + multi-connection demo | scripts/README.md |
| aws_infra/ | Vendored MiniStack simulator (git subtree) | aws_infra/README.md |
| tests/, tests_tasks/ | Unit + tier-integration test suites | (see Β§14) |
| models.py | Pydantic data models for action/observation/task | (inline Β§6) |
| client.py | OpenEnv HTTP/WebSocket client wrapper | β |
| inference.py | Single-model agent loop (matches RL eval mode of compare/) |
β |
| train_grpo.py | GRPO trainer (1,283 LOC) β MultiTurnEnvPool, Optuna, plotting |
(see train/README.md) |
| aws_rl_env_colab.ipynb | Colab driver for the full training pipeline | β |
| docs/figures/ | All README graphs and screenshots | β |
13. Configuration & Running
Docker (recommended)
make docker-build # build the image
make docker-run # foreground on :8000
make docker-run-detach # background
make docker-health # liveness probe
OpenEnv deployment
make openenv-validate # validate config
make openenv-build # build environment
make openenv-push # push to HuggingFace Spaces
Environment variables
| Variable | Default | Description |
|---|---|---|
AWS_INFRA_URL |
http://localhost:4566 |
MiniStack endpoint (used when POOL_SIZE=1) |
AWS_RL_ENV_POOL_SIZE |
1 |
Server-side MiniStack pool size; set to 8 for GRPO training |
AWS_RL_ENV_MINISTACK_BASE_PORT |
4566 |
First MiniStack port; pool covers [BASE, BASE + POOL_SIZE) |
BACKEND_TYPE |
simulator |
simulator (MiniStack) or aws (real AWS, no pool) |
AWS_ACCESS_KEY_ID |
test |
AWS credentials (any value works for the simulator) |
AWS_SECRET_ACCESS_KEY |
test |
AWS credentials (any value works for the simulator) |
AWS_DEFAULT_REGION |
us-east-1 |
AWS region |
MAX_STEPS |
15 |
Max steps per episode |
API_BASE_URL |
β | LLM API endpoint for inference.py |
MODEL_NAME |
β | LLM model name for inference.py |
HF_TOKEN |
β | HuggingFace token (dataset/adapter access, push) |
TEMPERATURE |
0.7 |
LLM sampling temperature |
Curriculum stats API
curriculum.get_stats()
# {
# "episode_count": 42,
# "tier": "intermediate",
# "tier_episodes": 12,
# "tier_success_rate": 0.75,
# "graduated_tasks": [0, 2, 4],
# "weak_spots": [11, 12],
# "skill_profile": {0: 0.95, 1: 0.8, ...},
# "spaced_rep_due": [0, 2],
# "avg_reward_last_10": 0.65
# }
14. Testing
The test suite covers both isolated unit logic and end-to-end task execution against MiniStack.
Unit tests β tests/
pytest tests/ -v
| File | Covers |
|---|---|
| test_aws_rl_env_environment.py | Environment lifecycle, reset/step semantics, reward integration |
| test_task_grader.py | All 5 grading strategies, partial progress, penalties, bonuses |
| test_resource_verifier.py | Per-service ground-truth verification (20+ services) |
| test_episode_tracker.py | Command parsing, dedup, monotonic progress, rollback detection |
| test_episode_context.py | Per-episode context lifecycle |
| test_drift_engine.py | Random drift selection, mutation application |
| test_hint_provider.py | Three-level progressive hints, decay computation |
| test_environment_designer.py | Setup-command provisioning |
| test_pool.py | Server-side MiniStackPool acquire/release, exhaustion |
| test_grpo_pool.py | Client-side GrpoPool connect/close, all-or-nothing rollback |
Tier integration tests β tests_tasks/
pytest tests_tasks/ -v
134 tasks exercised end-to-end:
| File | Tasks |
|---|---|
| test_warmup_tasks.py | 25 |
| test_beginner_tasks.py | 25 |
| test_intermediate_tasks.py | 25 |
| test_advanced_tasks.py | 25 |
| test_expert_tasks.py | 24 |
| test_drift_tasks.py | 9 |
| Total | 133 |
These tests double as the source of truth for canonical solutions used by the SFT dataset generator (extracted via AST β see data/README.md Β§1).
15. Tech stack
- Python 3.12,
uvfor dependency management, multi-stage Docker - FastAPI, OpenEnv (HTTP + WebSocket env protocol), uvicorn
- TRL β₯ 0.21 (
GRPOTrainer,GRPOConfig) - PEFT (LoRA), Unsloth (4-bit quantized base, fused training kernels)
- Transformers β₯ 4.45, datasets β₯ 2.20, HuggingFace Hub β₯ 0.24
- Optuna β₯ 3.6 (TPE sampler, SQLite study storage)
- asyncio + websockets + httpx (parallel rollout orchestration)
- MiniStack (vendored at aws_infra/, 34 AWS services)
- AWS CLI v2 (subprocess invocation against MiniStack endpoint)
- matplotlib, plotly (training curves, Optuna visualizations)
- pytest (16 test files, ~250 KB of test code)
16. Links
- Live demo: sizzing-aws-rl-env.hf.space/web
- HF Space: huggingface.co/spaces/Sizzing/aws_rl_env
- API docs: /docs Β· /redoc
- SFT adapter: Sizzing/aws-rl-sft-qwen25coder3b-adapter
- GRPO adapter: Sizzing/aws-rl-grpo-qwen25coder3b-adapter
- Dataset: Sizzing/aws-rl-sft
- GitHub: github.com/udaykiranpadhy/aws-rl-env
17. Acknowledgments
- Meta,HuggingFace,UnslothAndScalar for Organising hackathon and providing mentors to clarify the doubts.
- MiniStack β vendored at aws_infra/. Upstream license preserved. Custom modifications attributable to commits
a648c3a,a00e981; periodic upstream syncsaf2e945,579597b. - OpenEnv β environment protocol and Python client framework.
- TRL (HuggingFace) β
GRPOTrainerimplementation. - Unsloth β 4-bit quantized model loaders + fused training kernels.
- Google Colab for providing their infrastructure to train models.
- AWS service icons in server/static/img/aws/ β used in the web playground.
Sub-README index
For deep technical detail on any subsystem:
- server/README.md β environment internals (curriculum, reward shaping, anti-hacking, chaos, drift, MiniStack-fork detail)
- train/README.md β SFT + GRPO training pipeline (LoRA config, Optuna search, multi-turn rollouts)
- scripts/README.md β parallel-rollout architecture (3 pool layers, all-or-nothing connect, concurrency safety)
- data/README.md β dataset generation (5 trajectory types, AST extraction) + base-model selection summary
- data/sft/MODEL_EVALUATION.md β full 11-model benchmark report
- compare/README.md β base vs SFT comparison harness
- aws_infra/README.md β vendored MiniStack upstream documentation (81 KB)


















