aws_rl_env / README.md
Sizzing's picture
Upload folder using huggingface_hub
5886d6f verified
metadata
title: AWS RL Environment Server
emoji: πŸ₯‡
colorFrom: pink
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

AWS Cloud Operations β€” RL Environment & Training Pipeline

Cloud agents fail in production not because they don’t know the commands β€” but because state drifts, services hiccup, and reward signals get gamed. We built an environment that simulates all three: 120+ AWS tasks under chaos and drift, an 8-layer anti-reward-hacking stack, and an adversarial curriculum that targets the agent’s own weak spots. After SFT β†’ GRPO on a single GPU with 8 parallel rollouts, format compliance hit 100%, exact-match jumped 39% β†’ 89%, and intermediate-tier success climbed 81% β†’ 87%.


Table of contents

  1. What this is & why it matters
  2. Highlights β€” full feature inventory
  3. Architecture
  4. Live demo & Quick Start
  5. Run on Colab
  6. Action / Observation spec
  7. Curriculum & Reward (overview)
  8. Training pipeline (SFT β†’ GRPO)
  9. Parallel rollout architecture
  10. MiniStack: vendored & customized
  11. Results & Benchmarks
  12. Repository map
  13. Configuration & Running
  14. Testing
  15. Tech stack
  16. Links
  17. Acknowledgments

1. What this is & why it matters

Modern AI agents are increasingly asked to operate cloud infrastructure β€” provisioning resources, fixing misconfigurations, responding to drift. Training such agents needs (a) a realistic environment, (b) reliable reward signals, and (c) enough scale to make RL feasible. Existing options force a hard tradeoff: real AWS costs hundreds of dollars per training run and is impossible to reset; toy emulators don't behave like production AWS.

This project closes that gap. We built:

  1. An OpenEnv-compatible RL environment that speaks real AWS CLI semantics. The agent sends aws s3 mb …, aws iam create-role …, and so on β€” the exact same commands a human SRE would type.
  2. A vendored, customized MiniStack simulator that responds with production-equivalent JSON, runs locally for zero cost, supports 34 AWS services, and exposes a single-call state-introspection endpoint we added so the grader has cheap ground-truth access.
  3. A 120+ task curriculum across 5 tiers (warmup β†’ expert) with adaptive selection, mastery tracking, spaced repetition, chaos injection, and drift-detection scenarios β€” every feature designed to keep the reward signal honest and prevent the agent from gaming it.
  4. A complete SFT β†’ GRPO training pipeline. A 1,500-row synthetic dataset spanning 5 trajectory shapes, an 11-model base benchmark, LoRA fine-tuning, and TRL GRPO with multi-turn rollouts and Optuna hyperparameter search.
  5. An 8-way parallel-rollout architecture. Server-side MiniStack pool, client-side GrpoPool, in-process MultiTurnEnvPool β€” three coordinated layers that let G=8 concurrent rollouts run on one GPU without state contamination.

Everything is reproducible: the dataset is generated by a deterministic script, the model selection is documented end-to-end, training entry points run on Colab, and the env runs locally in a single Docker container with no external network requirement.


2. Highlights β€” full feature inventory

This is the complete surface area of the project. Each entry links to deeper documentation in the corresponding sub-README.

Environment & Curriculum

Reward shaping

Resilience & adversarial features

  • Chaos injection β€” silent mid-episode mutations, tier-scaled probabilities (10/20/30%) on services the task is touching.
  • Drift detection β€” 6 expert tasks, 2–3 random mutations from a per-task pool, randomized per episode (no memorization).
  • Security-posture audit tasks β€” S3 public bucket lockdown, IAM least-privilege, Lambda secret rotation.
  • 8-layer anti-reward-hacking β€” ground-truth verification, dedup, grader invisibility, command allow-list, no-credit-for-reads, monotonic progress, exact resource-name validation, final state checks.

Training pipeline

Parallel rollout architecture

Vendored simulator

Operations & deployment


3. Architecture

System architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Docker container ──────────────────────────────────┐
β”‚                                                                                      β”‚
β”‚   FastAPI server  (port 8000)                                                        β”‚
β”‚   β”œβ”€β”€ OpenEnv router       /reset  /step  /state  /schema  /ws  /health              β”‚
β”‚   β”œβ”€β”€ Web playground       /web  (Jinja2 + 40 AWS icon SVGs)                         β”‚
β”‚   β”œβ”€β”€ env_factory          per-WS-session AwsRlEnvironment instance                  β”‚
β”‚   β”‚                        (acquires a MiniStack port from MiniStackPool)            β”‚
β”‚   └── Services                                                                       β”‚
β”‚       Curriculum Β· TaskGrader Β· ResourceVerifier Β· ChaosEngine Β· DriftEngine         β”‚
β”‚       HintProvider Β· EpisodeTracker Β· EnvironmentDesigner Β· EnvironmentStrategy      β”‚
β”‚                                                                                      β”‚
β”‚                                                                                      β”‚
β”‚   MiniStack instances    :4566  :4567  :4568  …  :4566+POOL_SIZE-1                   β”‚
β”‚   (vendored at aws_infra/, started by the Dockerfile entrypoint)                     β”‚
β”‚                                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–²                                  β–²
                β”‚ HTTP/WS                          β”‚ AWS CLI subprocess
                β”‚                                  β”‚ (AWS_ENDPOINT_URL=http://localhost:4566+i)
                β”‚                                  β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   RL Agent        β”‚              β”‚  AWS CLI commands β”‚
        β”‚   the agent emits β”‚              β”‚  (client.py)      β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Episode lifecycle

  1. reset() β€” wipes simulator state, picks next task from the curriculum, runs setup_commands, applies drift if applicable, returns initial observation.
  2. step(action) β€” validates the command (must start with aws ), intercepts hint requests, executes via the strategy, records in tracker, grades with shaped reward, optionally injects chaos, returns observation.
  3. Hint β€” agent sends aws help --task-hint; intercepted before reaching MiniStack; returns next-level hint, increments hints_used (which decays final reward by 0.85^n).
  4. Termination β€” task_achieved=True or step_count >= MAX_STEPS (default 15).

Full mechanics in At server/README.md file.


4. Live demo & Quick Start

Try it in a browser

The hosted playground lets you click around any task without writing code:

Hugging Face Spaces Playground

Python client

from aws_rl_env import AwsRlAction, AwsRlEnv

with AwsRlEnv.from_docker_image("aws-rl-env:latest") as env:
    result = env.reset()
    print(f"Task: {result.observation.task.description}")

    result = env.step(AwsRlAction(command="aws s3 mb s3://my-bucket"))
    print(f"Reward: {result.reward}, Done: {result.done}")

Or against a running server:

env = AwsRlEnv(base_url="http://localhost:8000")
result = env.reset()
result = env.step(AwsRlAction(command="aws s3 ls"))

WebSocket API

import websockets, json

async with websockets.connect("wss://sizzing-aws-rl-env.hf.space/ws") as ws:
    await ws.send(json.dumps({"type": "reset"}))
    obs = json.loads(await ws.recv())

    await ws.send(json.dumps({"type": "step", "data": {"command": "aws s3 ls"}}))
    obs = json.loads(await ws.recv())

Local Docker

make docker-build           # build the image
make docker-run             # foreground; serves on :8000
make docker-run-detach      # background
make docker-health          # liveness probe

For training (8-way parallel rollouts):

AWS_RL_ENV_POOL_SIZE=8 make run

5. Run on Colab

The full pipeline is reproducible on a Colab GPU runtime. Drop your token into Colab Secrets, set ENV_BASE_URL to your HF Space (or local with ngrok), and run.

Replace each <!-- TODO --> with the Colab badge URL once published.


6. Action / Observation spec

The full Pydantic data models β€” kept inline so any reader can wire up an agent without leaving this page. Source: models.py.

Action

class AwsRlAction(Action):
    command: str   # AWS CLI command, e.g. "aws s3 ls"

The environment validates that command starts with aws ; anything else is rejected with success=False.

Observation

class AwsRlObservation(Observation):
    episode_id: EpisodeID
    step_count: StepCount
    command_success: bool          # exit code == 0
    command_output: str            # stdout from the AWS CLI invocation
    error: str                     # stderr (empty if success)
    task: TaskInfo | None          # masked task definition (no success criteria)
    task_achieved: bool
    partial_progress: float        # current task progress in [0.0, 1.0]
    hints_used: int                # cumulative hint count this episode
    hint_text: str                 # most recent hint text (if any)

State

class AwsRlState(State):
    current_task: Task | None      # full task assigned for the episode
    tracker: TrackerState          # episode tracker snapshot
    infra_state: dict              # AWS infrastructure state keyed by service name
    chaos_occurred: bool           # whether chaos was injected this episode
    current_tier: str              # agent's current difficulty tier

class TrackerState:
    step_count: int                # steps taken this episode
    hints_used: int                # hints requested this episode
    progress: float                # current partial progress [0.0, 1.0]
    commands_executed: list[str]   # commands executed this episode
    credited_operations: list[str] # (operation, resource) pairs that earned credit

Task definitions

class Task:
    task_id: TaskID
    difficulty: TaskDifficulty       # warmup | beginner | intermediate | advanced | expert
    description: str                 # human-readable goal
    success_criteria: SuccessCriteria
    setup_commands: list[SetupCommand]      # pre-provision for SRE tasks
    desired_state_spec: str | None          # natural-language desired end state (drift tasks)
    possible_drifts: list[SetupCommand]     # pool of mutations for DriftEngine

class TaskInfo:
    """Agent-visible subset of Task β€” masks success_criteria, setup_commands, and possible_drifts."""
    task_id: TaskID
    difficulty: TaskDifficulty
    description: str
    desired_state_spec: str | None

class SuccessCriteria:
    command_contains: str | None                   # warmup/beginner
    operation: str | None                          # warmup/beginner
    resource_exists: ResourceExistsCheck | None    # beginner
    steps: list[StepCriteria]                      # intermediate/advanced/expert
    services: list[AwsService]                     # advanced/expert
    state_checks: list[StateCheck]                 # expert (ground truth)

Curriculum config

class TierConfig:
    min_episodes: int          # minimum episodes before promotion
    advance_rate: float        # tier success rate threshold (0.6 - 1.0)
    mastery_window: int        # sliding window size (default: 10)
    mastery_threshold: float   # per-task graduation threshold (default: 0.7)
    fast_track_rate: float    # early promotion threshold (default: 0.9)
    chaos_probability: float   # probability of chaos injection per step

class SpacedRepState:
    interval: int                  # episodes until next re-test (3 β†’ 48)
    last_graduated_episode: int    # when last graduated

7. Curriculum & Reward (overview)

The curriculum and reward stack is the heart of the project. This section is the elevator pitch; the full mechanics β€” priority scoring math, anti-reward-hacking layers, chaos engine, drift engine β€” live in server/README.md.

Priority scoring (one-formula task selection)

score = novelty_bonus          # +100 if never attempted
      + weakness_weight        # +50 Γ— (1 βˆ’ task_success_rate)
      + spaced_rep_bonus       # +30 if a graduated task is "due" for re-test
      βˆ’ recency_penalty        # βˆ’20 if attempted in the last 2 episodes

Exploration, weakness-targeting, anti-forgetting, and variety β€” all balanced by one weighted sum.

Reward shaping

if task_achieved:
    reward = 1.0
    if survived_chaos:    reward *= 1.05      # chaos survival bonus
else:
    reward = partial_progress * 0.8           # ≀ 0.8 from steps alone
    if progress_increased: reward += 0.1      # dense progress signal
    if command_failed:     reward *= 0.5      # error penalty
    reward -= 0.1 * rollback_count            # waste penalty
    reward += 0.02 * idempotent_retries       # graceful retry bonus
    reward = clamp(reward, 0.0, 0.99)         # 1.0 reserved for completion

reward *= 0.85 ** hints_used                  # hint decay applied last

The agent's loss surface is intentionally narrow: only doing the task earns full reward, and every reward-hacking shortcut we identified during design has a defense layer (full list in Server's Readme file section Β§9).

Curriculum progression: 5 tiers, priority scoring formula, mastery + spaced rep + fast-track


8. Training pipeline (SFT β†’ GRPO)

The training pipeline runs in two stages, both reproducible on Colab. Full detail in train/README.md.

                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ data/sft/ ──────────┐
                      β”‚  1,500 train Β· 150 val rows   β”‚
                      β”‚  5 trajectory types           β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β–Ό
   STAGE 1 β€” Supervised Fine-Tuning   train/train_sft_lora.ipynb
   Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) β†’ SFT adapter
                                      β”‚
                                      β”‚ Sizzing/aws-rl-sft-qwen25coder3b-adapter
                                      β–Ό
   STAGE 2 β€” GRPO RL                  train/train_grpo_lora.ipynb
   G=8 parallel rollouts Β· multi-turn Β· reward = env return
   Optuna over (lr, Ξ², G, T, top_p, lora_r, max_turns)

Numbers worth knowing

Base model unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit β€” picked via Through model evaluation
SFT LoRA r ∈ {8,16,32}, lora_alpha = r Γ— multiplier, target = attention only, dropout [0.005, 0.031]
GRPO config G=8, Ξ²=0.04, lr=5e-6, T=0.9, top_p=0.95, max_turns=6, loss=dapo
Optuna search TPE sampler, 6 trials Γ— 30 GRPO steps, frozen 10-task held-out val set
Final training 200 GRPO steps with best config

Training graphs

Embed once notebook is executed: SFT loss curve GRPO mean reward over training Per-rollout reward by curriculum tier Optuna parameter importance


9. Parallel rollout architecture

GRPO needs G rollouts on the same task per training step. We run all G in parallel with state isolation guaranteed. Three coordinated pool layers make it work:

                        Trainer (G=8 generations needed per step)
                                        β”‚
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β–Ό                    β–Ό                    β–Ό
            MultiTurnEnvPool        GrpoPool            (in-process)
            (train_grpo.py)         (scripts/grpo_pool.py)
            sync API                async API
                   β”‚                    β”‚
                   └─────── 8 WebSocket connections β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
                            FastAPI server  :8000
                            + OpenEnv max_concurrent_envs=8
                                        β”‚
                                        β–Ό
                            MiniStackPool (free-list, lock-guarded)
                            acquire(port) on connect, release on disconnect
                                        β”‚
                                        β–Ό
                    8 isolated MiniStack instances :4566..:4573

Wall-clock impact: an 8-rollout Γ— 6-turn episode runs in ~300 ms of env time vs ~2.4 s sequential. Full mechanics, including the all-or-nothing connect protocol that prevents pool-slot leakage on flake, are in Scripts README file.

Parallel rollout: 3 coordinated pool layers


10. MiniStack: vendored & customized

The simulator powering the env is vendored as a git subtree at aws_infra/, not pulled as a black-box dependency. We forked it because we needed:

  1. A custom /_ministack/state JSON endpoint so the grader can read the entire infra inventory in one HTTP call instead of iterating 20+ list APIs per grading pass. Added in commit a648c3a "feat: Add support for service state retrieval and action listing across multiple AWS services".
  2. A reproducible build with no runtime network requirement β€” the Docker image bundles a specific MiniStack revision.
  3. The freedom to extend service coverage on demand.

Custom commits live as small, isolated patches so periodic upstream syncs (af2e945, 579597b) replay cleanly. To inspect:

git show a648c3a               # the state-endpoint diff
git log --oneline -- aws_infra/  # only the aws_infra subtree history

Full subtree workflow + commit-by-commit detail in server/README.md Β§5. Upstream MiniStack docs (81 KB) are preserved at aws_infra/README.md.


11. Results & Benchmarks

Base-model selection

We evaluated 11 chat models on 27 held-out prompts. Qwen2.5-Coder-3B-Instruct wins on every metric that matters: 41% exact match (highest), 63% operation match (highest), 3.1 s/call (3Γ— faster than the 4B runner-up). Full report:

data/sft/MODEL_EVALUATION.md β€” 270-line writeup, per-model verdicts, methodology

Top 4 candidate models on the held-out benchmark

Base vs SFT β€” actual results

After running the SFT pipeline end-to-end, the eval delta on the same held-out prompts is striking:

Metric Base Post-SFT Delta
format_pct 33.3% 100.0% +66.7 pp
exact_pct 38.9% 88.9% +50.0 pp
service_pct 77.8% 88.9% +11.1 pp
operation_pct 61.1% 88.9% +27.8 pp
avg_len 85.8 74.7 βˆ’11 chars (tighter)

Base vs SFT eval-metrics comparison

Every target from data/sft/MODEL_EVALUATION.md Β§11 is met or exceeded. Format compliance is now perfect; the model never wraps commands in fences or quotes after SFT. Exact-match jumped from 39% to 89% β€” the agent now emits the canonical command for ~9 of every 10 prompts.

The richer two-mode benchmark (dataset eval + live RL env eval) is in compare/compare_base_vs_sft.ipynb; methodology in compare/README.md.

Dataset comparison: base vs SFT (per-row scores) RL env comparison: base vs SFT (per-episode rewards)

SFT training curves

SFT loss curve over training

Optuna SFT search

The best SFT trial (out of 6) used lora_r=16, lora_alpha=16, dropout=0.0058, lr=4.03e-4, warmup=0.1 β€” see train/README.md Β§3 for the full Optuna study table.

Optuna parameter importances Optuna optimization history

GRPO results (live multi-step env eval)

After 35 GRPO steps on top of the SFT adapter (best Optuna config: lr=1.6e-5, Ξ²=0.0021, T=0.99), we re-evaluated end-to-end on 100+ episodes:

Metric Base + SFT Base + SFT + GRPO Ξ”
Overall success rate 86.8% 86.2% βˆ’0.5 pp
Overall mean reward 0.883 0.877 βˆ’0.006
Beginner success 96.2% 100.0% +3.8 pp
Intermediate success 81.0% 87.0% +6.0 pp
Warmup success 96.0% 90.2% βˆ’5.8 pp
Expert success 22.2% 22.2% flat
Drift repair rate 22.2% 22.2% flat
Destructive-action fail rate 15.1% 14.7% βˆ’0.4 pp
Steps to solve 1.45 1.55 +0.10

SFT vs GRPO metrics grid SFT vs GRPO by tier

Honest reading: the 35-step GRPO run preserves the SFT gains and modestly improves the middle tiers (beginner +3.8 pp, intermediate +6.0 pp) β€” but does not crack the expert-tier bottleneck (22% success on SRE / drift / security-posture tasks). With longer GRPO runs and more curriculum exposure to expert tasks, this is the next gain to chase.

GRPO training curves

Per-step training signals from the final 35-step GRPO run:

GRPO final per-step training signals GRPO env reward over training

Optuna search across 4 trials picked the final config:

GRPO Optuna trial comparison GRPO Optuna parameter importances GRPO Optuna optimization history

Qualitative rollouts (post-GRPO)

One sample episode per tier:

Qualitative rollouts on representative tasks


12. Repository map

Path Purpose Sub-README
server/ OpenEnv FastAPI server, env logic, services, web playground server/README.md
train/ SFT and GRPO training notebooks train/README.md
data/ SFT dataset, base-model selection, eval harness data/README.md Β· MODEL_EVALUATION.md
compare/ Base vs SFT side-by-side benchmark compare/README.md
scripts/ Parallel-rollout architecture + multi-connection demo scripts/README.md
aws_infra/ Vendored MiniStack simulator (git subtree) aws_infra/README.md
tests/, tests_tasks/ Unit + tier-integration test suites (see Β§14)
models.py Pydantic data models for action/observation/task (inline Β§6)
client.py OpenEnv HTTP/WebSocket client wrapper β€”
inference.py Single-model agent loop (matches RL eval mode of compare/) β€”
train_grpo.py GRPO trainer (1,283 LOC) β€” MultiTurnEnvPool, Optuna, plotting (see train/README.md)
aws_rl_env_colab.ipynb Colab driver for the full training pipeline β€”
docs/figures/ All README graphs and screenshots β€”

13. Configuration & Running

Docker (recommended)

make docker-build          # build the image
make docker-run            # foreground on :8000
make docker-run-detach     # background
make docker-health         # liveness probe

OpenEnv deployment

make openenv-validate      # validate config
make openenv-build         # build environment
make openenv-push          # push to HuggingFace Spaces

Environment variables

Variable Default Description
AWS_INFRA_URL http://localhost:4566 MiniStack endpoint (used when POOL_SIZE=1)
AWS_RL_ENV_POOL_SIZE 1 Server-side MiniStack pool size; set to 8 for GRPO training
AWS_RL_ENV_MINISTACK_BASE_PORT 4566 First MiniStack port; pool covers [BASE, BASE + POOL_SIZE)
BACKEND_TYPE simulator simulator (MiniStack) or aws (real AWS, no pool)
AWS_ACCESS_KEY_ID test AWS credentials (any value works for the simulator)
AWS_SECRET_ACCESS_KEY test AWS credentials (any value works for the simulator)
AWS_DEFAULT_REGION us-east-1 AWS region
MAX_STEPS 15 Max steps per episode
API_BASE_URL β€” LLM API endpoint for inference.py
MODEL_NAME β€” LLM model name for inference.py
HF_TOKEN β€” HuggingFace token (dataset/adapter access, push)
TEMPERATURE 0.7 LLM sampling temperature

Curriculum stats API

curriculum.get_stats()
# {
#   "episode_count": 42,
#   "tier": "intermediate",
#   "tier_episodes": 12,
#   "tier_success_rate": 0.75,
#   "graduated_tasks": [0, 2, 4],
#   "weak_spots": [11, 12],
#   "skill_profile": {0: 0.95, 1: 0.8, ...},
#   "spaced_rep_due": [0, 2],
#   "avg_reward_last_10": 0.65
# }

14. Testing

The test suite covers both isolated unit logic and end-to-end task execution against MiniStack.

Unit tests β€” tests/

pytest tests/ -v
File Covers
test_aws_rl_env_environment.py Environment lifecycle, reset/step semantics, reward integration
test_task_grader.py All 5 grading strategies, partial progress, penalties, bonuses
test_resource_verifier.py Per-service ground-truth verification (20+ services)
test_episode_tracker.py Command parsing, dedup, monotonic progress, rollback detection
test_episode_context.py Per-episode context lifecycle
test_drift_engine.py Random drift selection, mutation application
test_hint_provider.py Three-level progressive hints, decay computation
test_environment_designer.py Setup-command provisioning
test_pool.py Server-side MiniStackPool acquire/release, exhaustion
test_grpo_pool.py Client-side GrpoPool connect/close, all-or-nothing rollback

Tier integration tests β€” tests_tasks/

pytest tests_tasks/ -v

134 tasks exercised end-to-end:

These tests double as the source of truth for canonical solutions used by the SFT dataset generator (extracted via AST β€” see data/README.md Β§1).


15. Tech stack

  • Python 3.12, uv for dependency management, multi-stage Docker
  • FastAPI, OpenEnv (HTTP + WebSocket env protocol), uvicorn
  • TRL β‰₯ 0.21 (GRPOTrainer, GRPOConfig)
  • PEFT (LoRA), Unsloth (4-bit quantized base, fused training kernels)
  • Transformers β‰₯ 4.45, datasets β‰₯ 2.20, HuggingFace Hub β‰₯ 0.24
  • Optuna β‰₯ 3.6 (TPE sampler, SQLite study storage)
  • asyncio + websockets + httpx (parallel rollout orchestration)
  • MiniStack (vendored at aws_infra/, 34 AWS services)
  • AWS CLI v2 (subprocess invocation against MiniStack endpoint)
  • matplotlib, plotly (training curves, Optuna visualizations)
  • pytest (16 test files, ~250 KB of test code)

16. Links


17. Acknowledgments

  • Meta,HuggingFace,UnslothAndScalar for Organising hackathon and providing mentors to clarify the doubts.
  • MiniStack β€” vendored at aws_infra/. Upstream license preserved. Custom modifications attributable to commits a648c3a, a00e981; periodic upstream syncs af2e945, 579597b.
  • OpenEnv β€” environment protocol and Python client framework.
  • TRL (HuggingFace) β€” GRPOTrainer implementation.
  • Unsloth β€” 4-bit quantized model loaders + fused training kernels.
  • Google Colab for providing their infrastructure to train models.
  • AWS service icons in server/static/img/aws/ β€” used in the web playground.

Sub-README index

For deep technical detail on any subsystem:

  • server/README.md β€” environment internals (curriculum, reward shaping, anti-hacking, chaos, drift, MiniStack-fork detail)
  • train/README.md β€” SFT + GRPO training pipeline (LoRA config, Optuna search, multi-turn rollouts)
  • scripts/README.md β€” parallel-rollout architecture (3 pool layers, all-or-nothing connect, concurrency safety)
  • data/README.md β€” dataset generation (5 trajectory types, AST extraction) + base-model selection summary
  • data/sft/MODEL_EVALUATION.md β€” full 11-model benchmark report
  • compare/README.md β€” base vs SFT comparison harness
  • aws_infra/README.md β€” vendored MiniStack upstream documentation (81 KB)

Small Video Explanation