Spaces:

vedkdev
/

FlakyTestSleuthOpenEnvRL

Sleeping

App Files Files Community

vedkdev commited on 30 days ago

Commit

761f203

verified ·

1 Parent(s): e9f7c2a

Upload folder using huggingface_hub

Browse files

Files changed (32) hide show

02-deployment.md +427 -0
Dockerfile +18 -0
GRADING.md +253 -0
README.md +93 -5
__init__.py +11 -0
client.py +34 -0
dataset/build_dataset.py +461 -0
dataset/category_similarity.json +17 -0
dataset/fixtures/toy_project/src/math_utils.py +6 -0
dataset/fixtures/toy_project/tests/test_flaky.py +7 -0
env/__init__.py +4 -0
env/environment.py +216 -0
env/models.py +42 -0
env/sandbox.py +241 -0
env/task_loader.py +69 -0
flakysleuth_build_plan.md +1236 -0
graders/__init__.py +17 -0
graders/task1_grader.py +16 -0
graders/task2_grader.py +59 -0
graders/task3_grader.py +161 -0
inference.py +298 -0
inference_compliance.py +188 -0
inference_debug.py +606 -0
models.py +3 -0
openenv.yaml +37 -0
pyproject.toml +34 -0
requirements.txt +10 -0
server.py +8 -0
server/__init__.py +3 -0
server/app.py +102 -0
tests/test_compliance.py +18 -0
uv.lock +0 -0

02-deployment.md ADDED Viewed

	@@ -0,0 +1,427 @@

+# 2. Deploying an OpenEnv environment
+This section covers deploying OpenEnv environments locally, on clusters, and on Hugging Face Spaces.
+**Contents:**
+- [Local Development with Uvicorn](#local-development-with-uvicorn)
+- [Docker Deployment](#docker-deployment)
+- [Hugging Face Spaces](#hugging-face-spaces)
+- [Best Practices](#best-practices)
+## HF Spaces are the infrastructure for OpenEnv environments
+Every HF Space provides three things that OpenEnv environments need:
+| Component | What it provides | How to access | Used as |
+|-----------|------------------|---------------|-----------|
+| **Server** | Running environment endpoint | `https://<username>-<space-name>.hf.space` | Agent and Public API |
+| **Repository** | Installable Python package | `pip install git+https://huggingface.co/spaces/<username>-<space-name>` | Code and client |
+| **Registry** | Docker container image | `docker pull registry.hf.space/<username>-<space-name>:latest` | Deployment |
+This means a single Space deployment gives you all the components you need to use an environment in training.
+### 1. Server: A running environment endpoint
+When you deploy to HF Spaces, your environment runs as a server. The client connects via **WebSocket** (`/ws`) for a persistent session:
+```python
+from echo_env import EchoEnv, EchoAction
+# Connect directly to the running Space (WebSocket under the hood)
+# Async (recommended):
+async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
+    result = await client.reset()
+    result = await client.step(EchoAction(message="Hello"))
+# Sync (using .sync() wrapper):
+with EchoEnv(base_url="https://openenv-echo-env.hf.space").sync() as client:
+    result = client.reset()
+    result = client.step(EchoAction(message="Hello"))
+```
+**Endpoints available:**
+| Endpoint | Protocol | Description |
+|----------|----------|-------------|
+| `/ws` | **WebSocket** | Persistent session (used by client) |
+| `/health` | HTTP GET | Health check |
+| `/reset` | HTTP POST | Reset environment (stateless) |
+| `/step` | HTTP POST | Execute action (stateless) |
+| `/state` | HTTP GET | Get current state |
+| `/docs` | HTTP GET | OpenAPI documentation |
+| `/web` | HTTP GET | Interactive web UI |
+> **Note:** The Python client uses the `/ws` WebSocket endpoint by default. HTTP endpoints are available for debugging or stateless use cases.
+**Example: Check if a Space is running**
+```bash
+curl https://openenv-echo-env.hf.space/health
+# {"status": "healthy"}
+```
+### 2. Repository: Installable Python package
+Every Space is a Git repository. OpenEnv environments include a `pyproject.toml`, making them pip-installable directly from the Space URL.
+```bash
+# Install client package from Space
+pip install git+https://huggingface.co/spaces/openenv/echo-env
+```
+This installs:
+- **Client class** (`EchoEnv`) — Handles HTTP/WebSocket communication
+- **Models** (`EchoAction`, `EchoObservation`) — Typed action and observation classes
+- **Utilities** — Any helper functions the environment provides
+**After installation:**
+```python
+from envs.echo_env import EchoEnv, EchoAction, EchoObservation
+# Now you have typed classes for the environment
+action = EchoAction(message="Hello")
+```
+### 3. Registry: Docker container image
+Every Docker-based Space has a container registry. You can pull and run the environment locally.
+```bash
+# Pull the image
+docker pull registry.hf.space/openenv-echo-env:latest
+# Run locally on port 8001
+docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
+```
+**Find the registry URL for any Space:**
+1. Go to the Space page (e.g., [openenv/echo-env](https://huggingface.co/spaces/openenv/echo-env))
+2. Click **⋮** (three dots) → **"Run locally"**
+3. Copy the `docker run` command
+### Choosing an access method
+| Method | Use when | Pros | Cons |
+|--------|----------|------|------|
+| **Server** | Quick testing, low volume | Zero setup | Network latency, rate limits |
+| **Repository** | Need typed classes | Type safety, IDE support | Still need a server |
+| **Docker** | Local dev, high throughput | Full control, no network | Requires Docker |
+**Typical workflow:**
+```python
+import asyncio
+from echo_env import EchoEnv, EchoAction
+async def main():
+    # Development: connect to remote Space
+    async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
+        result = await client.reset()
+    # Production: run locally for speed
+    # docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
+    async with EchoEnv(base_url="http://localhost:8001") as client:
+        result = await client.reset()
+    # Or let the client manage Docker for you
+    client = await EchoEnv.from_env("openenv/echo-env")  # Auto-pulls and runs
+    async with client:
+        result = await client.reset()
+asyncio.run(main())
+# For sync usage, use the .sync() wrapper:
+with EchoEnv(base_url="http://localhost:8001").sync() as client:
+    result = client.reset()
+```
+> **Reference:** [HF Spaces Documentation](https://huggingface.co/docs/hub/spaces) | [Environment Hub Collection](https://huggingface.co/collections/openenv/environment-hub)
+## Local Development with Uvicorn
+The fastest way to iterate on environment logic is running directly with Uvicorn.
+## Clone and run the environment locally
+```bash
+# Clone from HF Space
+git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
+cd openenv-benchmark
+# Install in editable mode
+uv sync
+# Start server
+uv run server
+# Run isolated from remote Space
+uv run --isolated --project https://huggingface.co/spaces/burtenshaw/openenv-benchmark server
+```
+## Uvicorn directly in python
+```bash
+# Full control over uvicorn options
+uvicorn benchmark.server.app:app --host "$HOST" --port "$PORT" --workers "$WORKERS"
+# With reload for development
+uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --reload
+# Multi-Worker Mode For better concurrency:
+uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --workers 4
+```
+| Flag | Purpose |
+|------|---------|
+| `--reload` | Auto-restart on code changes |
+| `--workers N` | Run N worker processes |
+| `--log-level debug` | Verbose logging |
+## Docker Deployment
+Docker provides isolation and reproducibility for production use.
+### Run the environment locally from the space
+```bash
+# Run the environment locally from the space
+docker run -d -p 8000:8000 registry.hf.space/openenv-echo-env:latest
+```
+### Build Image
+```bash
+# Clone from HF Space
+git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
+cd openenv-benchmark
+# Using OpenEnv CLI (recommended)
+openenv build -t openenv-benchmark:latest
+# Or with Docker directly
+docker build -t openenv-benchmark:latest -f server/Dockerfile .
+```
+### Run Container
+```bash
+# Basic run
+docker run -d -p 8000:8000 my-env:latest
+# With environment variables
+docker run -d -p 8000:8000 \
+    -e WORKERS=4 \
+    -e MAX_CONCURRENT_ENVS=100 \
+    my-env:latest
+# Named container for easy management
+docker run -d --name my-env -p 8000:8000 my-env:latest
+```
+### Connect from Python
+```python
+import asyncio
+from echo_env import EchoEnv, EchoAction
+async def main():
+    # Async usage (recommended)
+    async with EchoEnv(base_url="http://localhost:8000") as client:
+        result = await client.reset()
+        result = await client.step(EchoAction(message="Hello"))
+        print(result.observation)
+    # From Docker image
+    client = await EchoEnv.from_docker_image("<local_docker_image>")
+    async with client:
+        result = await client.reset()
+        print(result.observation)
+asyncio.run(main())
+# Sync usage (using .sync() wrapper)
+with EchoEnv(base_url="http://localhost:8000").sync() as client:
+    result = client.reset()
+    result = client.step(EchoAction(message="Hello"))
+    print(result.observation)
+```
+### Container Lifecycle
+| Method | Container | WebSocket | On `close()` |
+|--------|-----------|-----------|--------------|
+| `from_hub(repo_id)` | Starts | Connects | Stops container |
+| `from_hub(repo_id, use_docker=False)` | None (UV) | Connects | Stops UV server |
+| `from_docker_image(image)` | Starts | Connects | Stops container |
+| `MyEnv(base_url=...)` | None | Connects | Disconnects only |
+Find Docker Commands for Any Space
+1. Open the Space on HuggingFace Hub
+2. Click **⋮ (three dots)** menu
+3. Select **"Run locally"**
+4. Copy the provided `docker run` command
+## Deploy with CLI
+```bash
+cd my_env
+# Deploy to your namespace
+openenv push
+# Deploy to specific repo
+openenv push --repo-id username/my-env
+# Deploy as private
+openenv push --repo-id username/my-env --private
+```
+### Space Configuration
+The `openenv.yaml` manifest controls Space settings:
+```yaml
+# openenv.yaml
+name: my_env
+version: "1.0.0"
+description: My custom environment
+```
+Hardware Options:
+| Tier | vCPU | RAM | Cost |
+|------|------|-----|------|
+| CPU Basic (Free) | 2 | 16GB | Free |
+| CPU Upgrade | 8 | 32GB | $0.03/hr |
+OpenEnv environments support configuration via environment variables.
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `WORKERS` | 4 | Uvicorn worker processes |
+| `PORT` | 8000 | Server port |
+| `HOST` | 0.0.0.0 | Bind address |
+| `MAX_CONCURRENT_ENVS` | 100 | Max WebSocket sessions |
+| `ENABLE_WEB_INTERFACE` | Auto | Enable web UI |
+### Environment-Specific Variables
+Some environments have custom variables:
+**TextArena:**
+```bash
+TEXTARENA_ENV_ID=Wordle-v0
+TEXTARENA_NUM_PLAYERS=1
+TEXTARENA_MAX_TURNS=6
+```
+**Coding Environment:**
+```bash
+SANDBOX_TIMEOUT=30
+MAX_OUTPUT_LENGTH=10000
+```
+# DEMO: Deploying to Hugging Face Spaces
+This demo walks through the full workflow: create an environment, test locally, deploy to HF Spaces, and use it.
+## Step 1: Initialize a new environment
+```bash
+openenv init my_env
+cd my_env
+```
+This creates the standard OpenEnv structure:
+```
+my_env/
+├── server/
+│   ├── app.py           # FastAPI server
+│   ├── environment.py   # Your environment logic
+│   └── Dockerfile
+├── models.py            # Action/Observation types
+├── client.py            # HTTP client
+├── openenv.yaml         # Manifest
+└── pyproject.toml
+```
+## Step 2: Run locally
+```bash
+# Start the server
+uv run server
+# Or with uvicorn directly
+uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+Test the health endpoint:
+```bash
+curl http://localhost:8000/health
+# {"status": "healthy"}
+```
+## Step 3: Deploy to HF Spaces
+```bash
+openenv push --repo-id username/my-env
+```
+Your environment is now live at:
+- Web UI: https://username-my-env.hf.space/web
+- API Docs: https://username-my-env.hf.space/docs
+- Health: https://username-my-env.hf.space/health
+```bash
+curl https://openenv-echo-env.hf.space/health
+# {"status": "healthy"}
+```
+## Step 4: install the environment
+```bash
+uv pip install git+https://huggingface.co/spaces/openenv/echo_env
+```
+## Step 5: Run locally via Docker (optional)
+Pull and run the container from the HF registry, or open the [browser](https://huggingface.co/spaces/openenv/echo_env?docker=true):
+```bash
+# Pull from HF Spaces registry
+docker pull registry.hf.space/openenv-echo-env:latest
+# Run locally
+docker run -it -p 7860:7860 --platform=linux/amd64 \
+	registry.hf.space/openenv-echo-env:latest
+```
+Now connect to your local instance:
+```python
+import asyncio
+from echo_env import EchoEnv, EchoAction
+# Async (recommended)
+async def main():
+    async with EchoEnv(base_url="http://localhost:8000") as env:
+        result = await env.reset()
+        print(result.observation)
+        result = await env.step(EchoAction(message="Hello"))
+        print(result.observation)
+asyncio.run(main())
+# Sync (using .sync() wrapper)
+with EchoEnv(base_url="http://localhost:8000").sync() as env:
+    result = env.reset()
+    print(result.observation)
+    result = env.step(EchoAction(message="Hello"))
+    print(result.observation)
+```

Dockerfile ADDED Viewed

	@@ -0,0 +1,18 @@

+FROM python:3.11-slim
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    patch \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 8000
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["python", "-m", "server.app"]

GRADING.md ADDED Viewed

	@@ -0,0 +1,253 @@

+# FlakySleuth Grading: Exact Scoring Formulas
+This document describes the **exact scoring logic implemented in code** for:
+- Task 1: `classify` (`classify_flakiness`)
+- Task 2: `root_cause` (`classify_root_cause`)
+- Task 3: `fix_proposal` (`propose_fix`)
+It also explains how per-step rewards are combined inside the environment.
+## Source of Truth
+- `env/environment.py`
+- `graders/__init__.py`
+- `graders/task1_grader.py`
+- `graders/task2_grader.py`
+- `graders/task3_grader.py`
+- `dataset/category_similarity.json`
+## 1) Dispatch: Which grader is used?
+`graders/grade_action()` selects grader by `task["task_type"]`:
+- `classify` -> Task 1 grader
+- `root_cause` -> Task 2 grader
+- `fix_proposal` -> Task 3 grader
+- anything else -> `0.0`
+## 2) Environment reward pipeline (applies to all tasks)
+At each `env.step(action)`:
+1. If action is terminal (`classify_flakiness`, `classify_root_cause`, `propose_fix`):
+   - compute `terminal_score = grade_action(action, task)`
+   - compute penalties
+   - final step reward:
+```text
+reward = clamp(
+    cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
+    0.0,
+    1.0
+)
+```
+Where:
+- `late_penalty = max(0, step_count - 15) * 0.05`
+- `wrong_dir_penalty = 0.2` only when:
+  - action is `classify_flakiness`
+  - predicted argument is `"stable"`
+  - ground-truth label is `"flaky"`
+- `done = True`
+2. If action is non-terminal (exploration):
+   - compute `progress` from exploration action
+   - update cumulative progress:
+```text
+cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
+reward = progress
+```
+3. Timeout rule:
+   - if not already done and `step_count >= max_steps`, set `done = True`
+   - no additional terminal score is applied at timeout.
+## 3) Exploration progress rewards (exact values)
+### `read_file`
+- file missing/unsafe -> `progress = -0.05`
+- file already read in this episode -> `progress = 0.0`
+- new file:
+  - if file path contains `task["test_file"]` -> `0.07`
+  - else if file ends with `.py` -> `0.03`
+  - else -> `0.01`
+### `search_code`
+- if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
+- otherwise -> `0.01`
+### `run_test`
+- if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
+- if category is order-dependent (`OD`, `OD-Brit`, `OD-Vic`) -> `0.0`
+### unsupported action type
+- `progress = -0.05`
+## 4) Task 1 scorer (`classify_flakiness`)
+Binary exact-match scorer:
+```text
+if action_type != "classify_flakiness": return 0.0
+if predicted not in {"flaky","stable"}: return 0.0
+truth = task["label"] (default "flaky")
+terminal_score = 1.0 if predicted == truth else 0.0
+```
+Notes:
+- In current dataset builder, rows are written with `label = "flaky"` by default.
+- Predicting `"stable"` on flaky truth also triggers environment `wrong_dir_penalty = 0.2`.
+## 5) Task 2 scorer (`classify_root_cause`)
+Matrix-based similarity scorer.
+### 5.1 Category normalization
+Prediction and truth are normalized by:
+- trim
+- replace `_` with `-`
+- replace spaces with `-`
+- uppercase and map through canonical aliases:
+  - `OD-BRIT` -> `OD-Brit`
+  - `OD-VIC` -> `OD-Vic`
+  - etc.
+If normalized value is not in valid set, score is `0.0`.
+Truth category is the **first** category if semicolon-separated:
+```text
+raw_truth = str(task["category"]).split(";")[0]
+```
+### 5.2 Similarity scoring
+```text
+if predicted == truth: return 1.0
+else return similarity[predicted,truth] or similarity[truth,predicted] or 0.0
+```
+The similarity matrix is loaded from `dataset/category_similarity.json`.
+Current non-identity similarity entries:
+- `OD,OD-Brit`: `0.7`
+- `OD,OD-Vic`: `0.7`
+- `OD-Brit,OD-Vic`: `0.8`
+- `OD,NIO`: `0.4`
+- `OD,NDOI`: `0.3`
+- `NOD,TD`: `0.6`
+- `NOD,TZD`: `0.5`
+- `NOD,NDOI`: `0.5`
+- `TD,TZD`: `0.7`
+- `NOD,ID`: `0.3`
+- `UD,OD`: `0.2`
+- `UD,NOD`: `0.2`
+- `UD,NIO`: `0.2`
+- `UD,TD`: `0.2`
+- `UD,ID`: `0.2`
+Any missing pair defaults to `0.0`.
+## 6) Task 3 scorer (`propose_fix`)
+Hybrid weighted scorer:
+```text
+if action_type != "propose_fix": return 0.0
+if proposed_fix is empty: return 0.0
+total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
+terminal_score = round(clamp(total, 0.0, 1.0), 4)
+```
+### 6.1 `pattern_score`
+Category-specific keyword patterns are checked against the proposed diff.
+For category with pattern list:
+```text
+matches = number of patterns found (case-insensitive substring)
+pattern_score = min(1.0, matches / max(1, len(patterns) * 0.4))
+```
+If category has no pattern list:
+- `pattern_score = 0.5`
+Current pattern lists:
+- `TD`: `freeze_time`, `mock`, `patch`, `utcnow`, `datetime`, `monkeypatch`
+- `TZD`: `timezone`, `utc`, `pytz`, `zoneinfo`, `tzinfo`, `UTC`
+- `NOD`: `seed`, `mock`, `patch`, `deterministic`, `sorted`
+- `NIO`: `setup`, `teardown`, `fixture`, `yield`, `cleanup`, `autouse`
+- `ID`: `sorted(`, `list(`, `frozenset`, `OrderedDict`
+### 6.2 `apply_score` (`_check_diff_applies`)
+```text
+if diff does not contain both '---' and '+++': return 0.0
+if sandbox_root missing or not existing: return 0.3
+else run: patch --dry-run -p1 -i <temp_patch>
+  return 1.0 if patch exit code == 0
+  return 0.0 otherwise
+on exception: return 0.3
+```
+### 6.3 `judge_score` (`_llm_judge`)
+LLM judge behavior:
+- If no API key available -> `judge_score = 0.5`
+- Else sends a judge prompt asking for JSON `{"score": 0..10, "reason": ...}`
+- Parses integer score, clamps to `[0,10]`, then scales to `[0,1]`:
+```text
+judge_score = clamp(int_score, 0, 10) / 10
+```
+- On any judge exception / parse failure -> `judge_score = 0.5`
+API/model resolution in judge:
+- API key preference: `API_KEY` -> `OPENROUTER_API_KEY` -> `OPENAI_API_KEY`
+- Base URL:
+  - OpenRouter inferred -> `https://openrouter.ai/api/v1`
+  - else -> `https://api.openai.com/v1`
+- Model default:
+  - OpenRouter base URL -> `qwen/qwen3.6-plus:free`
+  - else -> `gpt-4o-mini`
+## 7) Worked examples
+### Example A: Task 1 correct classify early
+- `cumulative_progress = 0.05`
+- `terminal_score = 1.0`
+- `late_penalty = 0.0`
+- `wrong_dir_penalty = 0.0`
+```text
+reward = clamp(0.05 + 1.0 - 0 - 0, 0, 1) = 1.0
+```
+### Example B: Task 2 wrong category but some exploration
+- `cumulative_progress = 0.05`
+- `terminal_score = 0.0` (no similarity match)
+- penalties = `0`
+```text
+reward = clamp(0.05 + 0.0, 0, 1) = 0.05
+```
+### Example C: Task 3 with weak fix and no API key
+- `judge_score = 0.5` fallback
+- `apply_score` and `pattern_score` depend on diff contents
+- final weighted sum then clamped and rounded to 4 decimals.
+## 8) Important implementation notes
+- `cumulative_progress` is capped at `0.30` and never below `0.0`.
+- Terminal reward can be reduced by late penalty after step 15.
+- Timeout does not invoke grader; it only ends the episode.
+- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.

README.md CHANGED Viewed

@@ -1,10 +1,98 @@
 ---
-title: FlakyTestSleuthOpenEnvRL
-emoji: 😻
-colorFrom: pink
-colorTo: red
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: FlakySleuth Environment Server
+emoji: "🔍"
+colorFrom: blue
+colorTo: indigo
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# FlakySleuth Environment
+OpenEnv-compatible RL environment for flaky-test investigation in real Python repos.
+## Setup
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+## Build Dataset
+Input: raw IDoFT CSV (e.g. `py-data.csv`)
+Output: processed task CSV (`dataset/py_tasks.csv`)
+```bash
+python dataset/build_dataset.py --input py-data.csv --output dataset/py_tasks.csv
+```
+### `dataset/build_dataset.py` flags
+| Flag | Type | Default | Description |
+|---|---|---|---|
+| `--input` | `str` | `idoft/py-data.csv` | Path to raw IDoFT CSV |
+| `--output` | `str` | `dataset/py_tasks.csv` | Output processed task CSV |
+| `--validate-only` | bool | `False` | Validate schema + print summary only (no clone/fetch) |
+| `--limit` | `int` | `None` | Process first N rows only |
+Notes:
+- Uses live GitHub fetch at exact SHAs.
+- Optional `GITHUB_TOKEN` improves PR diff fetching/rate limits.
+## Run Server
+```bash
+python -m server.app
+```
+Quick check:
+```bash
+curl -s http://localhost:8000/health
+```
+## Run Inference
+Recommended (OpenRouter):
+```bash
+export OPENROUTER_API_KEY=your_openrouter_api_key
+export API_BASE_URL=https://openrouter.ai/api/v1
+export MODEL_NAME=qwen/qwen3.6-plus:free
+python inference.py --dataset-path dataset/py_tasks.csv --episodes-per-task 5
+```
+### `inference.py` flags
+| Flag | Type | Default | Description |
+|---|---|---|---|
+| `--dataset-path` | `str` | `dataset/py_tasks.csv` | Processed task CSV used by env |
+| `--episodes-per-task` | `int` | `5` | Episodes per selected task type |
+| `--task-types` | `str` | `classify,root_cause,fix_proposal` | Comma-separated task types |
+| `--no-progress` | bool | `False` | Disable progress bars |
+| `--trace-agent` | bool | `False` | Print model output, action/tool call, and step results |
+| `--trace-prompts` | bool | `False` | Also print prompts sent to the model |
+| `--trace-max-chars` | `int` | `2500` | Max chars per traced block |
+Trace to log:
+```bash
+python inference.py \
+  --dataset-path dataset/py_tasks.csv \
+  --episodes-per-task 5 \
+  --task-types classify,root_cause \
+  --trace-agent --trace-prompts > agent_trace.log 2>&1
+```
+## OpenEnv CLI
+```bash
+openenv/bin/openenv validate --json
+openenv/bin/openenv build
+openenv/bin/openenv push
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from client import FlakySleuthClient
+from env.environment import FlakySleuthEnv
+from models import FlakySleuthAction, FlakySleuthObservation, FlakySleuthReward
+__all__ = [
+    "FlakySleuthClient",
+    "FlakySleuthEnv",
+    "FlakySleuthAction",
+    "FlakySleuthObservation",
+    "FlakySleuthReward",
+]

client.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Any
+import requests
+from env.models import FlakySleuthAction
+@dataclass
+class FlakySleuthClient:
+    base_url: str
+    timeout_s: float = 30.0
+    def reset(self) -> dict[str, Any]:
+        response = requests.post(f"{self.base_url.rstrip('/')}/reset", timeout=self.timeout_s)
+        response.raise_for_status()
+        return response.json()
+    def step(self, action: FlakySleuthAction) -> dict[str, Any]:
+        payload = {"action": action.model_dump()}
+        response = requests.post(
+            f"{self.base_url.rstrip('/')}/step",
+            json=payload,
+            timeout=self.timeout_s,
+        )
+        response.raise_for_status()
+        return response.json()
+    def state(self) -> dict[str, Any]:
+        response = requests.get(f"{self.base_url.rstrip('/')}/state", timeout=self.timeout_s)
+        response.raise_for_status()
+        return response.json()

dataset/build_dataset.py ADDED Viewed

	@@ -0,0 +1,461 @@

+"""Offline dataset builder for FlakySleuth.
+Examples:
+  # Validate schema and show category/status summary only
+  python dataset/build_dataset.py --input py-data.csv --validate-only
+  # Build full task CSV (requires network access for repo cloning)
+  export GITHUB_TOKEN=...
+  python dataset/build_dataset.py --input py-data.csv --output dataset/py_tasks.csv
+"""
+from __future__ import annotations
+import argparse
+import csv
+import os
+import subprocess
+import tempfile
+from pathlib import Path
+from urllib.parse import urlparse
+import pandas as pd
+import requests
+try:
+    from tqdm import tqdm
+except Exception:  # pragma: no cover
+    tqdm = None
+TASK12_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
+TASK3_CATEGORIES = ["TD", "TZD", "NOD", "NIO", "ID"]
+PROJECT_URL_COL = "Project URL"
+SHA_COL = "SHA Detected"
+CATEGORY_COL = "Category"
+STATUS_COL = "Status"
+PR_LINK_COL = "PR Link"
+NOTES_COL = "Notes"
+TEST_NAME_ALIASES = [
+    "Pytest Test Name",
+    "Pytest Test Name (PathToFile::TestClass::TestMethod or PathToFile::TestMethod)",
+]
+OUTPUT_COLUMNS = [
+    "repo_url",
+    "sha",
+    "test_name",
+    "test_file",
+    "category",
+    "label",
+    "status",
+    "pr_link",
+    "task_types",
+    "test_code",
+    "known_fix_diff",
+]
+def _normalize_header(text: str) -> str:
+    return " ".join(str(text).strip().split())
+def _resolve_test_name_column(columns: list[str]) -> str:
+    normalized = {_normalize_header(c): c for c in columns}
+    for alias in TEST_NAME_ALIASES:
+        key = _normalize_header(alias)
+        if key in normalized:
+            return normalized[key]
+    raise KeyError(
+        "Could not find pytest test-name column. Expected one of: "
+        + ", ".join(TEST_NAME_ALIASES)
+    )
+def _parse_pr_link(pr_link: str) -> tuple[str, str] | None:
+    """Return (owner/repo, number) from URL or owner/repo#number."""
+    value = (pr_link or "").strip()
+    if not value or value.lower() == "nan":
+        return None
+    if value.startswith("http://") or value.startswith("https://"):
+        parsed = urlparse(value)
+        parts = [p for p in parsed.path.split("/") if p]
+        # Expected: /owner/repo/pull/number
+        if len(parts) >= 4 and parts[2] == "pull" and parts[3].isdigit():
+            return f"{parts[0]}/{parts[1]}", parts[3]
+        return None
+    if "#" in value:
+        repo, number = value.split("#", 1)
+        if repo.strip() and number.strip().isdigit():
+            return repo.strip(), number.strip()
+    return None
+def _is_accepted_status(status: str) -> bool:
+    value = (status or "").strip().lower()
+    return value in {"accepted", "merged", "fixed"}
+def _non_interactive_git_env() -> dict[str, str]:
+    env = os.environ.copy()
+    # Never block on credential prompts while iterating large public datasets.
+    env["GIT_TERMINAL_PROMPT"] = "0"
+    env["GCM_INTERACTIVE"] = "Never"
+    return env
+def _has_value(value: str) -> bool:
+    text = str(value or "").strip().lower()
+    return text not in {"", "nan", "none"}
+def _is_non_unmaintained_status(status: str) -> bool:
+    value = str(status or "").strip().lower()
+    return value not in {"", "nan", "none", "unmaintained"}
+def _row_preference_rank(row_out: dict[str, str]) -> tuple[int, int, int]:
+    task_tokens = {t.strip() for t in str(row_out.get("task_types", "")).split(";") if t.strip()}
+    return (
+        1 if "fix_proposal" in task_tokens else 0,
+        1 if _has_value(str(row_out.get("pr_link", ""))) else 0,
+        1 if _is_non_unmaintained_status(str(row_out.get("status", ""))) else 0,
+    )
+def fetch_test_code(repo_url: str, sha: str, pytest_test_name: str) -> tuple[str, str, str]:
+    test_file = pytest_test_name.split("::")[0]
+    git_env = _non_interactive_git_env()
+    with tempfile.TemporaryDirectory() as tmpdir:
+        try:
+            init = subprocess.run(
+                ["git", "init", tmpdir],
+                capture_output=True,
+                text=True,
+                check=False,
+                timeout=20,
+                env=git_env,
+                stdin=subprocess.DEVNULL,
+            )
+            if init.returncode != 0:
+                return "", "git_init_failed", (init.stderr or init.stdout or "").strip()[:200]
+            remote = subprocess.run(
+                ["git", "-C", tmpdir, "remote", "add", "origin", repo_url],
+                capture_output=True,
+                text=True,
+                check=False,
+                timeout=10,
+                env=git_env,
+                stdin=subprocess.DEVNULL,
+            )
+            if remote.returncode != 0:
+                return "", "git_remote_add_failed", (remote.stderr or remote.stdout or "").strip()[:200]
+            # Fetch only the requested commit for speed and correctness.
+            fetch = subprocess.run(
+                ["git", "-C", tmpdir, "fetch", "--depth=1", "origin", sha],
+                capture_output=True,
+                text=True,
+                check=False,
+                timeout=90,
+                env=git_env,
+                stdin=subprocess.DEVNULL,
+            )
+            if fetch.returncode != 0:
+                return "", "git_fetch_sha_failed", (fetch.stderr or fetch.stdout or "").strip()[:200]
+            checkout = subprocess.run(
+                ["git", "-C", tmpdir, "checkout", "--detach", "FETCH_HEAD"],
+                capture_output=True,
+                text=True,
+                check=False,
+                timeout=30,
+                env=git_env,
+                stdin=subprocess.DEVNULL,
+            )
+            if checkout.returncode != 0:
+                return "", "git_checkout_failed", (checkout.stderr or checkout.stdout or "").strip()[:200]
+        except subprocess.TimeoutExpired:
+            return "", "git_timeout", "timeout"
+        file_path = Path(tmpdir) / test_file
+        if not file_path.exists():
+            return "", "test_file_missing_at_sha", test_file
+        return file_path.read_text(encoding="utf-8", errors="replace")[:10000], "", ""
+def fetch_pr_diff(pr_link: str, github_token: str) -> str:
+    parsed = _parse_pr_link(pr_link)
+    if not parsed:
+        return ""
+    repo, number = parsed
+    url = f"https://api.github.com/repos/{repo}/pulls/{number}"
+    headers = {
+        "Authorization": f"token {github_token}",
+        "Accept": "application/vnd.github.diff",
+    }
+    response = requests.get(url, headers=headers, timeout=15)
+    if response.status_code == 200:
+        return response.text[:3000]
+    return ""
+def _validate_schema(input_csv: str) -> tuple[pd.DataFrame, str]:
+    df = pd.read_csv(input_csv)
+    df.columns = [_normalize_header(col) for col in df.columns]
+    missing = []
+    for required in [PROJECT_URL_COL, SHA_COL, CATEGORY_COL, STATUS_COL, PR_LINK_COL]:
+        if required not in df.columns:
+            missing.append(required)
+    if missing:
+        raise KeyError(f"Missing required columns: {missing}")
+    test_name_col = _resolve_test_name_column(list(df.columns))
+    return df, test_name_col
+def _print_input_summary(df: pd.DataFrame, test_name_col: str) -> None:
+    print("Input schema check: OK")
+    print(f"Rows: {len(df)}")
+    print(f"Using test-name column: {test_name_col}")
+    print("Columns:", list(df.columns))
+    print("\nCategory distribution (top 20):")
+    print(df[CATEGORY_COL].fillna("").astype(str).value_counts().head(20))
+    print("\nStatus distribution:")
+    print(df[STATUS_COL].fillna("").astype(str).value_counts().head(20))
+def build(
+    input_csv: str,
+    output_csv: str,
+    github_token: str,
+    *,
+    validate_only: bool = False,
+    limit: int | None = None,
+) -> None:
+    df, test_name_col = _validate_schema(input_csv)
+    _print_input_summary(df, test_name_col)
+    if validate_only:
+        return
+    total_rows = min(len(df), limit) if limit is not None else len(df)
+    print(
+        f"\nStarting build over {total_rows} rows "
+        f"(this can take a while: cloning repos + reading files + optional PR diff fetch)"
+    )
+    stats: dict[str, int] = {
+        "kept": 0,
+        "kept_unique": 0,
+        "skipped_missing_core_fields": 0,
+        "skipped_ud": 0,
+        "skipped_no_task_types": 0,
+        "skipped_test_code_fetch_failed": 0,
+        "skipped_test_code_fetch_git_fail": 0,
+        "skipped_test_code_fetch_file_missing": 0,
+        "fix_diff_fetched": 0,
+        "duplicate_key_rows_seen": 0,
+        "duplicate_key_replaced": 0,
+        "duplicate_key_kept_existing": 0,
+    }
+    fetch_fail_examples: list[dict[str, str]] = []
+    canonical_rows: dict[tuple[str, str, str], dict[str, str]] = {}
+    output_path = Path(output_csv)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    iterator = df.iterrows()
+    if tqdm is not None:
+        iterator = tqdm(iterator, total=total_rows, desc="Building tasks", unit="row")
+    with output_path.open("w", encoding="utf-8", newline="") as out_fp:
+        writer = csv.DictWriter(out_fp, fieldnames=OUTPUT_COLUMNS, extrasaction="ignore")
+        writer.writeheader()
+        out_fp.flush()
+        processed = 0
+        for idx, (_, row) in enumerate(iterator, start=1):
+            if idx > total_rows:
+                break
+            processed = idx
+            repo_url = str(row.get(PROJECT_URL_COL, "")).strip()
+            sha = str(row.get(SHA_COL, "")).strip()
+            test_name = str(row.get(test_name_col, "")).strip()
+            category_raw = str(row.get(CATEGORY_COL, "")).strip()
+            status = str(row.get(STATUS_COL, "")).strip()
+            pr_link = str(row.get(PR_LINK_COL, "")).strip()
+            if not repo_url or not sha or not test_name or not category_raw:
+                stats["skipped_missing_core_fields"] += 1
+                _update_progress(iterator, tqdm, stats)
+                continue
+            category = category_raw.split(";")[0].strip()
+            if category == "UD":
+                stats["skipped_ud"] += 1
+                _update_progress(iterator, tqdm, stats)
+                continue
+            task_types: list[str] = []
+            if category in TASK12_CATEGORIES:
+                task_types.extend(["classify", "root_cause"])
+            if category in TASK3_CATEGORIES and _is_accepted_status(status) and _parse_pr_link(pr_link):
+                task_types.append("fix_proposal")
+            if not task_types:
+                stats["skipped_no_task_types"] += 1
+                _update_progress(iterator, tqdm, stats)
+                continue
+            test_code, fetch_reason, fetch_detail = fetch_test_code(repo_url, sha, test_name)
+            if not test_code:
+                stats["skipped_test_code_fetch_failed"] += 1
+                if fetch_reason in {
+                    "git_init_failed",
+                    "git_remote_add_failed",
+                    "git_fetch_sha_failed",
+                    "git_checkout_failed",
+                    "git_timeout",
+                }:
+                    stats["skipped_test_code_fetch_git_fail"] += 1
+                if fetch_reason == "test_file_missing_at_sha":
+                    stats["skipped_test_code_fetch_file_missing"] += 1
+                if len(fetch_fail_examples) < 10:
+                    fetch_fail_examples.append(
+                        {
+                            "repo_url": repo_url,
+                            "sha": sha,
+                            "test_name": test_name,
+                            "reason": fetch_reason,
+                            "detail": fetch_detail,
+                        }
+                    )
+                _update_progress(iterator, tqdm, stats)
+                continue
+            known_fix_diff = ""
+            if "fix_proposal" in task_types and github_token:
+                known_fix_diff = fetch_pr_diff(pr_link, github_token)
+                if known_fix_diff:
+                    stats["fix_diff_fetched"] += 1
+            row_out = {
+                "repo_url": repo_url,
+                "sha": sha,
+                "test_name": test_name,
+                "test_file": test_name.split("::")[0],
+                "category": category,
+                "label": "flaky",
+                "status": status,
+                "pr_link": pr_link,
+                "task_types": ";".join(task_types),
+                "test_code": test_code,
+                "known_fix_diff": known_fix_diff,
+            }
+            writer.writerow(row_out)
+            out_fp.flush()
+            stats["kept"] += 1
+            row_key = (
+                row_out["repo_url"],
+                row_out["sha"],
+                row_out["test_name"],
+            )
+            if row_key not in canonical_rows:
+                canonical_rows[row_key] = row_out
+            else:
+                stats["duplicate_key_rows_seen"] += 1
+                current = canonical_rows[row_key]
+                if _row_preference_rank(row_out) > _row_preference_rank(current):
+                    canonical_rows[row_key] = row_out
+                    stats["duplicate_key_replaced"] += 1
+                else:
+                    stats["duplicate_key_kept_existing"] += 1
+            _update_progress(iterator, tqdm, stats, processed, total_rows)
+    out = pd.DataFrame(list(canonical_rows.values()), columns=OUTPUT_COLUMNS)
+    stats["kept_unique"] = len(out)
+    out.to_csv(output_csv, index=False)
+    if tqdm is None:
+        print()
+    print("\nBuild summary:")
+    for key, value in stats.items():
+        print(f"  {key}: {value}")
+    print(f"Built {len(out)} task rows -> {output_csv}")
+    if fetch_fail_examples:
+        print("\nSample fetch failures (first 10):")
+        for i, sample in enumerate(fetch_fail_examples, start=1):
+            print(
+                f"  {i}. reason={sample['reason']} "
+                f"repo={sample['repo_url']} sha={sample['sha']} "
+                f"test={sample['test_name']} detail={sample['detail']}"
+            )
+    if len(out):
+        print(out["category"].value_counts())
+        print(out["task_types"].value_counts())
+def _update_progress(
+    iterator,
+    tqdm_mod,
+    stats: dict[str, int],
+    processed: int | None = None,
+    total_rows: int | None = None,
+) -> None:
+    if tqdm_mod is not None and hasattr(iterator, "set_postfix"):
+        iterator.set_postfix(
+            kept=stats["kept"],
+            miss=stats["skipped_missing_core_fields"],
+            ud=stats["skipped_ud"],
+            no_task=stats["skipped_no_task_types"],
+            fetch_fail=stats["skipped_test_code_fetch_failed"],
+        )
+        return
+    if processed is None or total_rows is None:
+        return
+    if processed == 1 or processed % 20 == 0 or processed == total_rows:
+        print(
+            f"\r[{processed}/{total_rows}] "
+            f"kept={stats['kept']} "
+            f"fetch_fail={stats['skipped_test_code_fetch_failed']} "
+            f"no_task={stats['skipped_no_task_types']}",
+            end="",
+            flush=True,
+        )
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Build FlakySleuth task dataset")
+    parser.add_argument("--input", default="idoft/py-data.csv", help="Path to IDoFT py-data.csv")
+    parser.add_argument("--output", default="dataset/py_tasks.csv", help="Output CSV path")
+    parser.add_argument(
+        "--validate-only",
+        action="store_true",
+        help="Validate input schema and print summary, without cloning/fetching.",
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        default=None,
+        help="Optional max input rows to process (useful for quick sanity checks).",
+    )
+    args = parser.parse_args()
+    github_token = os.environ.get("GITHUB_TOKEN", "")
+    build(
+        args.input,
+        args.output,
+        github_token,
+        validate_only=args.validate_only,
+        limit=args.limit,
+    )
+if __name__ == "__main__":
+    main()

dataset/category_similarity.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "OD,OD-Brit": 0.7,
+  "OD,OD-Vic": 0.7,
+  "OD-Brit,OD-Vic": 0.8,
+  "OD,NIO": 0.4,
+  "OD,NDOI": 0.3,
+  "NOD,TD": 0.6,
+  "NOD,TZD": 0.5,
+  "NOD,NDOI": 0.5,
+  "TD,TZD": 0.7,
+  "NOD,ID": 0.3,
+  "UD,OD": 0.2,
+  "UD,NOD": 0.2,
+  "UD,NIO": 0.2,
+  "UD,TD": 0.2,
+  "UD,ID": 0.2
+}

dataset/fixtures/toy_project/src/math_utils.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import random
+def unstable_sum(values):
+    random.shuffle(values)
+    return values[0] + values[1]

dataset/fixtures/toy_project/tests/test_flaky.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from src.math_utils import unstable_sum
+def test_randomized_total():
+    values = [1, 2, 3]
+    total = unstable_sum(values)
+    assert total == 3

env/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from env.environment import FlakySleuthEnv
+from env.models import FlakySleuthAction, FlakySleuthObservation, FlakySleuthReward
+__all__ = ["FlakySleuthEnv", "FlakySleuthAction", "FlakySleuthObservation", "FlakySleuthReward"]

env/environment.py ADDED Viewed

	@@ -0,0 +1,216 @@

+from __future__ import annotations
+from typing import Any
+from env.models import FlakySleuthAction, FlakySleuthObservation
+from env.sandbox import Sandbox
+from env.task_loader import TaskLoader
+from graders import grade_action
+FLAKY_SIGNAL_PATTERNS = [
+    "sleep",
+    "random",
+    "time",
+    "datetime",
+    "thread",
+    "asyncio",
+    "fixture",
+    "setup",
+    "teardown",
+    "global",
+    "shared",
+    "singleton",
+    "os.environ",
+    "socket",
+    "timeout",
+    "retry",
+    "mock",
+    "patch",
+]
+TERMINAL_ACTIONS = ("classify_flakiness", "classify_root_cause", "propose_fix")
+class FlakySleuthEnv:
+    def __init__(self, dataset_path: str = "dataset/py_tasks.csv", max_steps: int = 20):
+        self.loader = TaskLoader(dataset_path)
+        self.sandbox: Sandbox | None = None
+        self.current_task: dict[str, Any] | None = None
+        self.step_count = 0
+        self.max_steps = max_steps
+        self.cumulative_progress = 0.0
+        self.files_read: set[str] = set()
+        self.episode_actions: list[FlakySleuthAction] = []
+    def reset(self) -> FlakySleuthObservation:
+        if self.sandbox:
+            self.sandbox.cleanup()
+        self.current_task = self.loader.sample()
+        self.current_task.setdefault("label", "flaky")
+        self.sandbox = Sandbox(self.current_task)
+        self.sandbox.setup()
+        self.current_task["sandbox_root"] = self.sandbox.tmpdir or ""
+        test_file = self.current_task.get("test_file", "")
+        if test_file and self.sandbox.tmpdir:
+            self.current_task["sandbox_test_path"] = f"{self.sandbox.tmpdir}/{test_file}"
+        self.step_count = 0
+        self.cumulative_progress = 0.0
+        self.files_read = set()
+        self.episode_actions = []
+        return self._make_obs()
+    def step(self, action: FlakySleuthAction):
+        if not self.current_task or not self.sandbox:
+            raise RuntimeError("Environment is not initialized. Call reset() first.")
+        self.step_count += 1
+        self.episode_actions.append(action)
+        tool_output: str | None = None
+        reward = 0.0
+        done = False
+        info: dict[str, Any] = {}
+        if action.action_type in TERMINAL_ACTIONS:
+            terminal_score = grade_action(action, self.current_task)
+            late_penalty = max(0, self.step_count - 15) * 0.05
+            wrong_dir_penalty = 0.0
+            if (
+                action.action_type == "classify_flakiness"
+                and action.argument.strip().lower() == "stable"
+                and str(self.current_task.get("label", "flaky")).lower() == "flaky"
+            ):
+                wrong_dir_penalty = 0.2
+            reward = min(
+                1.0,
+                max(
+                    0.0,
+                    self.cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
+                ),
+            )
+            done = True
+            info = {
+                "terminal_score": terminal_score,
+                "progress_score": self.cumulative_progress,
+                "late_penalty": late_penalty,
+                "task_type": self.current_task.get("task_type"),
+                "category": self.current_task.get("category"),
+            }
+        else:
+            tool_output, progress = self._execute_exploration(action)
+            self.cumulative_progress = min(0.30, max(0.0, self.cumulative_progress + progress))
+            reward = progress
+        if not done and self.step_count >= self.max_steps:
+            done = True
+            info = {
+                "terminal_score": 0.0,
+                "progress_score": self.cumulative_progress,
+                "late_penalty": max(0, self.step_count - 15) * 0.05,
+                "timeout": True,
+                "task_type": self.current_task.get("task_type"),
+                "category": self.current_task.get("category"),
+            }
+        obs = self._make_obs(tool_output)
+        return obs, reward, done, info
+    def state(self) -> dict[str, Any]:
+        if not self.current_task:
+            return {
+                "repo_url": None,
+                "test_name": None,
+                "task_type": None,
+                "step_count": self.step_count,
+                "files_read": [],
+                "cumulative_progress": self.cumulative_progress,
+            }
+        return {
+            "repo_url": self.current_task.get("repo_url"),
+            "test_name": self.current_task.get("test_name"),
+            "task_type": self.current_task.get("task_type"),
+            "step_count": self.step_count,
+            "files_read": sorted(self.files_read),
+            "cumulative_progress": self.cumulative_progress,
+        }
+    def close(self) -> None:
+        if self.sandbox:
+            self.sandbox.cleanup()
+            self.sandbox = None
+    def _execute_exploration(self, action: FlakySleuthAction) -> tuple[str, float]:
+        assert self.current_task is not None
+        assert self.sandbox is not None
+        progress = 0.0
+        output = ""
+        if action.action_type == "read_file":
+            content = self.sandbox.read_file(action.argument)
+            if content is None:
+                output = f"ERROR: File not found: {action.argument}"
+                progress = -0.05
+            elif action.argument in self.files_read:
+                output = content
+                progress = 0.0
+            else:
+                self.files_read.add(action.argument)
+                output = content
+                progress = self._file_relevance_reward(action.argument)
+        elif action.action_type == "search_code":
+            output = self.sandbox.grep(action.argument)
+            progress = self._search_relevance_reward(action.argument)
+        elif action.action_type == "run_test":
+            output = self.sandbox.run_test(self.current_task.get("test_name", ""))
+            category = str(self.current_task.get("category", "")).strip()
+            if category not in ("OD", "OD-Brit", "OD-Vic"):
+                progress = 0.05
+        else:
+            output = f"ERROR: Unsupported action_type {action.action_type}"
+            progress = -0.05
+        return output, progress
+    def _file_relevance_reward(self, filepath: str) -> float:
+        assert self.current_task is not None
+        test_file = str(self.current_task.get("test_file", ""))
+        if test_file and test_file in filepath:
+            return 0.07
+        if filepath.endswith(".py"):
+            return 0.03
+        return 0.01
+    def _search_relevance_reward(self, pattern: str) -> float:
+        pattern_lower = pattern.lower()
+        if any(signal in pattern_lower for signal in FLAKY_SIGNAL_PATTERNS):
+            return 0.04
+        return 0.01
+    def _make_obs(self, tool_output: str | None = None) -> FlakySleuthObservation:
+        if not self.current_task:
+            raise RuntimeError("No current task available")
+        return FlakySleuthObservation(
+            repo_url=str(self.current_task.get("repo_url", "")),
+            test_name=str(self.current_task.get("test_name", "")),
+            test_code=str(self.current_task.get("test_code", ""))[:2000],
+            file_tree=self.sandbox.file_tree if self.sandbox else [],
+            tool_output=tool_output,
+            task_type=str(self.current_task.get("task_type", "classify")),
+            task_description=str(self.current_task.get("task_description", "Investigate the flaky test.")),
+            step_count=self.step_count,
+            done=False,
+            reward=None,
+        )

env/models.py ADDED Viewed

	@@ -0,0 +1,42 @@

+from __future__ import annotations
+from typing import Any, Literal
+from pydantic import BaseModel, Field
+try:
+    from openenv.core.env_server.types import Action, Observation
+except Exception:  # pragma: no cover
+    Action = BaseModel  # type: ignore[misc,assignment]
+    Observation = BaseModel  # type: ignore[misc,assignment]
+TaskType = Literal["classify", "root_cause", "fix_proposal"]
+class FlakySleuthObservation(Observation):
+    repo_url: str = Field(..., description="Repository URL or fixture reference")
+    test_name: str = Field(..., description="Pytest test identifier")
+    test_code: str = Field(..., description="Test source snippet")
+    file_tree: list[str] = Field(default_factory=list, description="Top-level file tree")
+    tool_output: str | None = Field(default=None, description="Result of the previous exploratory action")
+    task_type: TaskType = Field(..., description="Current task type")
+    task_description: str = Field(..., description="Instruction for the agent")
+    step_count: int = Field(default=0, description="Current episode step count")
+class FlakySleuthAction(Action):
+    action_type: Literal[
+        "read_file",
+        "search_code",
+        "run_test",
+        "classify_flakiness",
+        "classify_root_cause",
+        "propose_fix",
+    ] = Field(..., description="Action to execute")
+    argument: str = Field(default="", description="Action argument")
+class FlakySleuthReward(BaseModel):
+    score: float
+    breakdown: dict[str, Any]
+    explanation: str

env/sandbox.py ADDED Viewed

	@@ -0,0 +1,241 @@

+from __future__ import annotations
+import os
+import shutil
+import subprocess
+import tempfile
+from pathlib import Path
+class Sandbox:
+    def __init__(self, task: dict):
+        self.task = task
+        self.tmpdir: str | None = None
+        self.file_tree: list[str] = []
+    def setup(self) -> None:
+        """Prepare a working copy of the repository for the episode."""
+        self.tmpdir = tempfile.mkdtemp(prefix="flakysleuth_")
+        repo_url = str(self.task.get("repo_url", "")).strip()
+        sha = str(self.task.get("sha", "")).strip()
+        try:
+            if repo_url.startswith("fixture://"):
+                self._copy_fixture_repo(repo_url)
+            else:
+                self._clone_repo(repo_url, sha)
+            self.file_tree = self._build_file_tree()
+        except Exception as exc:
+            self.cleanup()
+            raise RuntimeError(f"Sandbox setup failed: {exc}") from exc
+    def read_file(self, relative_path: str) -> str | None:
+        """Read a file relative to sandbox root. Returns None when not found/unsafe."""
+        if not self.tmpdir:
+            return None
+        root = os.path.abspath(self.tmpdir)
+        full_path = os.path.abspath(os.path.join(root, relative_path))
+        # Path traversal guard.
+        if os.path.commonpath([root, full_path]) != root:
+            return None
+        if not os.path.isfile(full_path):
+            return None
+        try:
+            with open(full_path, "r", encoding="utf-8", errors="replace") as handle:
+                return handle.read()[:4000]
+        except Exception:
+            return None
+    def grep(self, pattern: str) -> str:
+        """Search .py files in repo, preferring ripgrep and falling back to grep."""
+        if not self.tmpdir:
+            return "ERROR: Sandbox not initialized"
+        rg_cmd = ["rg", "-n", "--glob", "*.py", pattern, "."]
+        grep_cmd = ["grep", "-RIn", "--include=*.py", pattern, "."]
+        try:
+            result = subprocess.run(
+                rg_cmd,
+                cwd=self.tmpdir,
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+        except FileNotFoundError:
+            # ripgrep not installed in runtime; fall back to POSIX grep.
+            try:
+                result = subprocess.run(
+                    grep_cmd,
+                    cwd=self.tmpdir,
+                    capture_output=True,
+                    text=True,
+                    timeout=10,
+                )
+            except FileNotFoundError:
+                return (
+                    "Search error: neither 'rg' (ripgrep) nor 'grep' is installed in the "
+                    "runtime."
+                )
+            except subprocess.TimeoutExpired:
+                return "Search timed out"
+            except Exception as exc:
+                return f"Search error: {exc}"
+        except subprocess.TimeoutExpired:
+            return "Search timed out"
+        except Exception as exc:
+            return f"Search error: {exc}"
+        try:
+            output = (result.stdout + result.stderr).strip()[:2000]
+            if output:
+                return output
+            return f"No matches found for: {pattern}"
+        except Exception as exc:
+            return f"Search error: {exc}"
+    def run_test(self, pytest_test_name: str) -> str:
+        """Run a test for non-order-dependent categories."""
+        if not self.tmpdir:
+            return "ERROR: Sandbox not initialized"
+        category = str(self.task.get("category", "")).strip()
+        if category in ("OD", "OD-Brit", "OD-Vic"):
+            return (
+                "Test execution skipped for order-dependent tests. "
+                "Use read_file and search_code for static analysis. "
+                "Look for shared state, missing cleanup, or global mutations."
+            )
+        try:
+            result = subprocess.run(
+                [
+                    "python",
+                    "-m",
+                    "pytest",
+                    pytest_test_name,
+                    "--tb=short",
+                    "-x",
+                    "--timeout=30",
+                    "-q",
+                ],
+                cwd=self.tmpdir,
+                capture_output=True,
+                text=True,
+                timeout=60,
+            )
+            output = (result.stdout + result.stderr).strip()[:2000]
+            return output or "Test completed with no output"
+        except subprocess.TimeoutExpired:
+            return "Test execution timed out (>60s)"
+        except Exception as exc:
+            return f"Test execution error: {exc}"
+    def cleanup(self) -> None:
+        if self.tmpdir and os.path.exists(self.tmpdir):
+            shutil.rmtree(self.tmpdir, ignore_errors=True)
+        self.tmpdir = None
+        self.file_tree = []
+    def _clone_repo(self, repo_url: str, sha: str) -> None:
+        if not repo_url:
+            raise ValueError("Missing repo_url")
+        assert self.tmpdir is not None
+        sha = (sha or "").strip()
+        # Robust path: fetch the exact commit directly (works even when not in shallow branch history).
+        if sha and sha.lower() != "nan":
+            init = subprocess.run(
+                ["git", "init", self.tmpdir],
+                capture_output=True,
+                text=True,
+                timeout=20,
+            )
+            if init.returncode != 0:
+                raise RuntimeError(f"git init failed: {init.stderr.strip()}")
+            remote = subprocess.run(
+                ["git", "-C", self.tmpdir, "remote", "add", "origin", repo_url],
+                capture_output=True,
+                text=True,
+                timeout=15,
+            )
+            if remote.returncode != 0:
+                raise RuntimeError(f"git remote add failed: {remote.stderr.strip()}")
+            fetch = subprocess.run(
+                ["git", "-C", self.tmpdir, "fetch", "--depth=1", "origin", sha],
+                capture_output=True,
+                text=True,
+                timeout=120,
+            )
+            if fetch.returncode != 0:
+                raise RuntimeError(
+                    "git fetch exact sha failed: "
+                    + (fetch.stderr.strip() or fetch.stdout.strip())
+                )
+            checkout = subprocess.run(
+                ["git", "-C", self.tmpdir, "checkout", "--detach", "FETCH_HEAD"],
+                capture_output=True,
+                text=True,
+                timeout=30,
+            )
+            if checkout.returncode != 0:
+                raise RuntimeError(
+                    "git checkout fetched sha failed: "
+                    + (checkout.stderr.strip() or checkout.stdout.strip())
+                )
+            return
+        # Fallback for rows without a SHA.
+        clone = subprocess.run(
+            ["git", "clone", "--depth=50", repo_url, self.tmpdir],
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+        if clone.returncode != 0:
+            raise RuntimeError(
+                "git clone failed: " + (clone.stderr.strip() or clone.stdout.strip())
+            )
+    def _copy_fixture_repo(self, repo_url: str) -> None:
+        fixture_name = repo_url.replace("fixture://", "", 1).strip("/")
+        if not fixture_name:
+            raise ValueError("Fixture name missing in repo_url")
+        fixture_dir = (
+            Path(__file__).resolve().parent.parent
+            / "dataset"
+            / "fixtures"
+            / fixture_name
+        )
+        if not fixture_dir.exists():
+            raise FileNotFoundError(f"Fixture repo not found: {fixture_dir}")
+        assert self.tmpdir is not None
+        shutil.copytree(fixture_dir, self.tmpdir, dirs_exist_ok=True)
+    def _build_file_tree(self) -> list[str]:
+        assert self.tmpdir is not None
+        result: list[str] = []
+        for root, dirs, files in os.walk(self.tmpdir):
+            dirs[:] = [
+                d
+                for d in dirs
+                if not d.startswith(".")
+                and d not in ("node_modules", "__pycache__", ".git", "venv", ".tox")
+            ]
+            depth = root.replace(self.tmpdir, "").count(os.sep)
+            if depth <= 2:
+                for file_name in files:
+                    rel_path = os.path.relpath(os.path.join(root, file_name), self.tmpdir)
+                    result.append(rel_path)
+            if len(result) > 100:
+                break
+        return result[:100]

env/task_loader.py ADDED Viewed

	@@ -0,0 +1,69 @@

+from __future__ import annotations
+import csv
+import random
+from pathlib import Path
+from typing import Any
+class TaskLoader:
+    def __init__(self, csv_path: str):
+        path = Path(csv_path)
+        if not path.exists():
+            raise FileNotFoundError(f"Task CSV not found: {csv_path}")
+        self.tasks: list[dict[str, Any]] = []
+        with path.open("r", encoding="utf-8", newline="") as handle:
+            reader = csv.DictReader(handle)
+            for row in reader:
+                task_types = str(row.get("task_types", "")).split(";")
+                for raw_type in task_types:
+                    task_type = raw_type.strip()
+                    if not task_type:
+                        continue
+                    entry = dict(row)
+                    entry["task_type"] = task_type
+                    self.tasks.append(entry)
+        if not self.tasks:
+            raise ValueError(f"No tasks loaded from {csv_path}")
+        self._forced_type: str | None = None
+    def sample(self) -> dict[str, Any]:
+        pool = self.tasks
+        if self._forced_type:
+            pool = [task for task in self.tasks if task["task_type"] == self._forced_type]
+        if not pool:
+            raise ValueError(f"No tasks available for task type: {self._forced_type}")
+        task = random.choice(pool).copy()
+        task["task_description"] = self._make_description(task)
+        return task
+    def force_task_type(self, task_type: str | None) -> None:
+        self._forced_type = task_type
+    def _make_description(self, task: dict[str, Any]) -> str:
+        task_type = task["task_type"]
+        if task_type == "classify":
+            return (
+                "Investigate the given test and determine whether it is FLAKY or STABLE. "
+                "Use read_file and search_code to gather evidence. "
+                "When confident, call classify_flakiness with argument 'flaky' or 'stable'."
+            )
+        if task_type == "root_cause":
+            return (
+                "This test is confirmed flaky. Identify its root cause category. "
+                "Valid categories: OD, OD-Brit, OD-Vic, NIO, NOD, TD, TZD, ID, NDOI. "
+                "Use read_file and search_code to find evidence. "
+                "Call classify_root_cause with the category code when confident."
+            )
+        if task_type == "fix_proposal":
+            return (
+                f"This test is confirmed flaky with root cause: {task.get('category', 'unknown')}. "
+                "Propose a concrete fix as a unified diff. "
+                "Use read_file and search_code to understand the code. "
+                "Call propose_fix with a valid unified diff string."
+            )
+        return "Investigate the flaky test."

flakysleuth_build_plan.md ADDED Viewed

	@@ -0,0 +1,1236 @@

+# FlakySleuth — Comprehensive Round 1 Build Plan
+## Meta × PyTorch × Scaler OpenEnv Hackathon
+---
+## 0. What You Are Building (One Paragraph for Clarity)
+You are building an **OpenEnv-compliant RL environment** called `FlakySleuthEnv`. It simulates a real software engineering task: investigating flaky tests in real Python GitHub repositories. An LLM agent is dropped into a sandboxed repo at a specific commit, given a test that is known to be flaky (sourced from the IDoFT dataset), and must use tool calls (read files, grep code, run tests) to investigate and produce a verdict. The environment scores the agent's verdict using deterministic graders (Tasks 1 and 2) and a hybrid programmatic + LLM judge grader (Task 3). You are NOT training any model. The submitted artifact is the environment itself — its graders, reward logic, OpenEnv spec compliance, Docker container, and a baseline `inference.py` script that proves it works.
+---
+## 1. Repository Structure
+```
+flaky-sleuth-env/
+│
+├── inference.py                  ← REQUIRED: must be named exactly this, in root
+├── openenv.yaml                  ← REQUIRED: OpenEnv spec metadata
+├── Dockerfile                    ← REQUIRED: must build and run
+├── requirements.txt
+├── README.md
+│
+├── server.py                     ← FastAPI HTTP server (OpenEnv endpoints)
+│
+├── env/
+│   ├── __init__.py
+│   ├── models.py                 ← All Pydantic models (Observation, Action, Reward)
+│   ├── environment.py            ← FlakySleuthEnv core class
+│   ├── sandbox.py                ← Git clone, file read, grep, run_test
+│   └── task_loader.py            ← Loads tasks from dataset CSV
+│
+├── graders/
+│   ├── __init__.py               ← grade_action() dispatcher
+│   ├── task1_grader.py           ← Binary flaky/stable
+│   ├── task2_grader.py           ← Root cause category + similarity matrix
+│   └── task3_grader.py           ← Fix proposal: pattern + diff + LLM judge
+│
+├── dataset/
+│   ├── build_dataset.py          ← OFFLINE SCRIPT: preprocess IDoFT → py_tasks.csv
+│   ├── py_tasks.csv              ← Final preprocessed task bank (committed to repo)
+│   └── category_similarity.json  ← Similarity matrix for Task 2 partial credit
+│
+└── tests/
+    └── test_compliance.py        ← openenv validate compliance checks
+```
+---
+## 2. Data Pipeline (Do This First, Offline)
+### 2.1 Download the Raw Dataset
+```bash
+git clone https://github.com/TestingResearchIllinois/idoft
+# The file you need:
+# idoft/py-data.csv
+```
+### 2.2 Understand the CSV Columns
+The `py-data.csv` has these columns:
+```
+Project URL | SHA Detected | Pytest Test Name | Category | Status | PR Link | Notes
+```
+- **Project URL**: GitHub repo to clone
+- **SHA Detected**: Exact commit to clone at (this is where the test IS flaky)
+- **Pytest Test Name**: Format is `path/to/test_file.py::TestClass::test_method` or `path/to/test_file.py::test_method`
+- **Category**: One of OD, OD-Brit, OD-Vic, NIO, NOD, UD, TD, TZD, ID, NDOI, NDOD, OSD (may be semicolon-separated for multiple)
+- **Status**: Blank, Opened, Accepted, Rejected, etc.
+- **PR Link**: Format `owner/repo#number` — only present when Status is Opened/Accepted
+### 2.3 Filter Rules Per Task
+```python
+# Task 1 (classify): Use these categories — they have clear static signals
+TASK1_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
+# Task 2 (root cause): Same categories — agent must identify which one
+TASK2_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
+# Exclude "UD" (unknown — no ground truth to grade against)
+# Task 3 (fix proposal): ONLY rows where a fix was accepted AND category is gradeable
+TASK3_CATEGORIES = ["TD", "TZD", "NOD", "NIO", "ID"]
+# Exclude: OD, OD-Brit, OD-Vic (cannot verify fix without multi-order execution)
+# Exclude: UD (unknown cause = cannot score fix)
+# Require: Status == "Accepted" AND PR Link is not empty
+```
+### 2.4 Build `py_tasks.csv` (the `build_dataset.py` script)
+This script runs ONCE offline. It:
+1. Reads `idoft/py-data.csv`
+2. For each row, fetches the test source code by cloning the repo at SHA (or using GitHub raw API)
+3. For Task 3 rows (Status=Accepted), fetches the PR diff from GitHub API
+4. Outputs `dataset/py_tasks.csv`
+```python
+# dataset/build_dataset.py
+import pandas as pd
+import requests
+import subprocess
+import tempfile
+import os
+GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]  # set this before running
+def fetch_test_code(repo_url: str, sha: str, pytest_test_name: str) -> str:
+    """
+    Clone repo at SHA, extract the test function source code.
+    pytest_test_name format: path/to/test.py::TestClass::test_method
+    """
+    test_file = pytest_test_name.split("::")[0]
+    with tempfile.TemporaryDirectory() as tmpdir:
+        subprocess.run([
+            "git", "clone", "--depth=1", repo_url, tmpdir
+        ], capture_output=True)
+        subprocess.run([
+            "git", "checkout", sha
+        ], cwd=tmpdir, capture_output=True)
+        filepath = os.path.join(tmpdir, test_file)
+        if not os.path.exists(filepath):
+            return ""
+        with open(filepath) as f:
+            return f.read()[:5000]  # cap at 5000 chars
+def fetch_pr_diff(pr_link: str) -> str:
+    """
+    pr_link format: "owner/repo#number"
+    Returns unified diff string of the PR.
+    """
+    if not pr_link or "#" not in pr_link:
+        return ""
+    repo, number = pr_link.strip().split("#")
+    url = f"https://api.github.com/repos/{repo}/pulls/{number}"
+    headers = {
+        "Authorization": f"token {GITHUB_TOKEN}",
+        "Accept": "application/vnd.github.diff"
+    }
+    resp = requests.get(url, headers=headers, timeout=10)
+    if resp.status_code == 200:
+        return resp.text[:3000]  # cap diff size
+    return ""
+def build():
+    df = pd.read_csv("idoft/py-data.csv")
+    # Rename columns for clarity
+    df.columns = [c.strip() for c in df.columns]
+    rows = []
+    for _, row in df.iterrows():
+        repo_url = str(row.get("Project URL", "")).strip()
+        sha = str(row.get("SHA Detected", "")).strip()
+        test_name = str(row.get("Pytest Test Name", "")).strip()
+        category_raw = str(row.get("Category", "")).strip()
+        status = str(row.get("Status", "")).strip()
+        pr_link = str(row.get("PR Link", "")).strip()
+        # Skip rows with missing essentials
+        if not repo_url or not sha or not test_name or not category_raw:
+            continue
+        # Take primary category (first if semicolon-separated)
+        category = category_raw.split(";")[0].strip()
+        # Skip UD for Task 2 (no ground truth)
+        if category == "UD":
+            continue
+        # Determine task types this row is eligible for
+        task_types = []
+        if category in ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]:
+            task_types.append("classify")
+            task_types.append("root_cause")
+        if (category in ["TD", "TZD", "NOD", "NIO", "ID"]
+                and status == "Accepted"
+                and pr_link and pr_link != "nan"):
+            task_types.append("fix_proposal")
+        if not task_types:
+            continue
+        # Fetch test source code
+        test_code = fetch_test_code(repo_url, sha, test_name)
+        if not test_code:
+            continue
+        # Fetch fix diff for Task 3 eligible rows
+        known_fix_diff = ""
+        if "fix_proposal" in task_types:
+            known_fix_diff = fetch_pr_diff(pr_link)
+        rows.append({
+            "repo_url": repo_url,
+            "sha": sha,
+            "test_name": test_name,
+            "test_file": test_name.split("::")[0],
+            "category": category,
+            "status": status,
+            "pr_link": pr_link,
+            "task_types": ";".join(task_types),
+            "test_code": test_code,
+            "known_fix_diff": known_fix_diff,
+        })
+    out = pd.DataFrame(rows)
+    out.to_csv("dataset/py_tasks.csv", index=False)
+    print(f"Built {len(out)} task rows")
+    print(out["category"].value_counts())
+    print(out["task_types"].value_counts())
+if __name__ == "__main__":
+    build()
+```
+### 2.5 Build `category_similarity.json`
+```json
+{
+  "OD,OD-Brit": 0.7,
+  "OD,OD-Vic": 0.7,
+  "OD-Brit,OD-Vic": 0.8,
+  "OD,NIO": 0.4,
+  "OD,NDOI": 0.3,
+  "NOD,TD": 0.6,
+  "NOD,TZD": 0.5,
+  "NOD,NDOI": 0.5,
+  "TD,TZD": 0.7,
+  "NOD,ID": 0.3,
+  "UD,OD": 0.2,
+  "UD,NOD": 0.2,
+  "UD,NIO": 0.2,
+  "UD,TD": 0.2,
+  "UD,ID": 0.2
+}
+```
+---
+## 3. Pydantic Models (`env/models.py`)
+```python
+from pydantic import BaseModel
+from typing import Literal, Optional, List
+class FlakySleuthObservation(BaseModel):
+    repo_url: str
+    test_name: str
+    test_code: str
+    file_tree: List[str]
+    tool_output: Optional[str] = None
+    task_type: Literal["classify", "root_cause", "fix_proposal"]
+    task_description: str
+    step_count: int
+class FlakySleuthAction(BaseModel):
+    action_type: Literal[
+        "read_file",
+        "search_code",
+        "run_test",
+        "classify_flakiness",
+        "classify_root_cause",
+        "propose_fix",
+    ]
+    argument: str
+class FlakySleuthReward(BaseModel):
+    score: float
+    breakdown: dict
+    explanation: str
+```
+---
+## 4. Sandbox (`env/sandbox.py`)
+The sandbox wraps a cloned git repo. It handles all filesystem operations.
+```python
+import subprocess
+import tempfile
+import os
+import shutil
+from typing import Optional, List
+class Sandbox:
+    def __init__(self, task: dict):
+        self.task = task
+        self.tmpdir: Optional[str] = None
+        self.file_tree: List[str] = []
+    def setup(self):
+        """Clone repo at the specific SHA. Called by env.reset()."""
+        self.tmpdir = tempfile.mkdtemp(prefix="flakysleuth_")
+        try:
+            # Shallow clone for speed
+            subprocess.run([
+                "git", "clone", "--depth=50",
+                self.task["repo_url"],
+                self.tmpdir
+            ], capture_output=True, timeout=60, check=True)
+            # Checkout exact SHA where flakiness was detected
+            subprocess.run([
+                "git", "checkout", self.task["sha"]
+            ], cwd=self.tmpdir, capture_output=True, timeout=30, check=True)
+            self.file_tree = self._build_file_tree()
+        except Exception as e:
+            self.cleanup()
+            raise RuntimeError(f"Sandbox setup failed: {e}")
+    def read_file(self, relative_path: str) -> Optional[str]:
+        """Read a file relative to repo root. Returns None if not found."""
+        full_path = os.path.normpath(os.path.join(self.tmpdir, relative_path))
+        # Security: ensure path stays inside tmpdir
+        if not full_path.startswith(self.tmpdir):
+            return None
+        if not os.path.isfile(full_path):
+            return None
+        try:
+            with open(full_path, "r", errors="replace") as f:
+                return f.read()[:4000]  # cap to avoid huge files
+        except Exception:
+            return None
+    def grep(self, pattern: str) -> str:
+        """Grep for pattern across all .py files in the repo."""
+        if not self.tmpdir:
+            return "ERROR: Sandbox not initialized"
+        try:
+            result = subprocess.run(
+                ["grep", "-rn", "--include=*.py", pattern, "."],
+                cwd=self.tmpdir,
+                capture_output=True,
+                text=True,
+                timeout=10
+            )
+            output = result.stdout[:2000]
+            return output if output else f"No matches found for: {pattern}"
+        except subprocess.TimeoutExpired:
+            return "Search timed out"
+        except Exception as e:
+            return f"Search error: {e}"
+    def run_test(self, pytest_test_name: str) -> str:
+        """
+        Run the specific test via pytest.
+        ONLY called for non-OD tasks.
+        """
+        if self.task["category"] in ("OD", "OD-Brit", "OD-Vic"):
+            return (
+                "Test execution skipped for order-dependent tests. "
+                "Use read_file and search_code to analyze static code structure instead. "
+                "Look for: shared state, missing setUp/tearDown, module-scoped fixtures, global mutations."
+            )
+        try:
+            result = subprocess.run(
+                ["python", "-m", "pytest", pytest_test_name,
+                 "--tb=short", "-x", "--timeout=30", "-q"],
+                cwd=self.tmpdir,
+                capture_output=True,
+                text=True,
+                timeout=60
+            )
+            output = (result.stdout + result.stderr)[:2000]
+            return output if output else "Test completed with no output"
+        except subprocess.TimeoutExpired:
+            return "Test execution timed out (>60s)"
+        except Exception as e:
+            return f"Test execution error: {e}"
+    def cleanup(self):
+        """Remove temp directory. Called after episode ends."""
+        if self.tmpdir and os.path.exists(self.tmpdir):
+            shutil.rmtree(self.tmpdir, ignore_errors=True)
+        self.tmpdir = None
+        self.file_tree = []
+    def _build_file_tree(self) -> List[str]:
+        """Return top-2-level file paths relative to repo root."""
+        result = []
+        for root, dirs, files in os.walk(self.tmpdir):
+            # Skip hidden dirs and common noise
+            dirs[:] = [d for d in dirs if not d.startswith(".")
+                       and d not in ("node_modules", "__pycache__", ".git", "venv", ".tox")]
+            depth = root.replace(self.tmpdir, "").count(os.sep)
+            if depth <= 2:
+                for f in files:
+                    rel = os.path.relpath(os.path.join(root, f), self.tmpdir)
+                    result.append(rel)
+            if len(result) > 100:
+                break
+        return result[:100]
+```
+---
+## 5. Task Loader (`env/task_loader.py`)
+```python
+import pandas as pd
+import random
+from typing import Optional
+class TaskLoader:
+    def __init__(self, csv_path: str):
+        df = pd.read_csv(csv_path)
+        # Expand task_types column into individual rows
+        rows = []
+        for _, row in df.iterrows():
+            for tt in str(row["task_types"]).split(";"):
+                r = row.to_dict()
+                r["task_type"] = tt.strip()
+                rows.append(r)
+        self.tasks = rows
+        self._forced_type: Optional[str] = None
+    def sample(self) -> dict:
+        """Sample a random task, optionally filtered by type."""
+        pool = self.tasks
+        if self._forced_type:
+            pool = [t for t in self.tasks if t["task_type"] == self._forced_type]
+        task = random.choice(pool).copy()
+        task["task_description"] = self._make_description(task)
+        return task
+    def force_task_type(self, task_type: str):
+        """Force next sample() calls to return a specific task type."""
+        self._forced_type = task_type
+    def _make_description(self, task: dict) -> str:
+        tt = task["task_type"]
+        if tt == "classify":
+            return (
+                "Investigate the given test and determine whether it is FLAKY or STABLE. "
+                "Use read_file and search_code to gather evidence. "
+                "When confident, call classify_flakiness with argument 'flaky' or 'stable'."
+            )
+        elif tt == "root_cause":
+            return (
+                f"This test is confirmed flaky. Identify its root cause category. "
+                f"Valid categories: OD, OD-Brit, OD-Vic, NIO, NOD, TD, TZD, ID, NDOI. "
+                f"Use read_file and search_code to find evidence. "
+                f"Call classify_root_cause with the category code when confident."
+            )
+        elif tt == "fix_proposal":
+            return (
+                f"This test is confirmed flaky with root cause: {task['category']}. "
+                f"Propose a concrete fix as a unified diff. "
+                f"Use read_file and search_code to understand the code. "
+                f"Call propose_fix with a valid unified diff string."
+            )
+        return "Investigate the flaky test."
+```
+---
+## 6. Core Environment (`env/environment.py`)
+```python
+import random
+from env.models import FlakySleuthObservation, FlakySleuthAction
+from env.sandbox import Sandbox
+from env.task_loader import TaskLoader
+from graders import grade_action
+FLAKY_SIGNAL_PATTERNS = [
+    "sleep", "random", "time", "datetime", "thread", "asyncio",
+    "fixture", "setUp", "tearDown", "global", "shared", "singleton",
+    "os.environ", "socket", "timeout", "retry", "mock", "patch"
+]
+class FlakySleuthEnv:
+    def __init__(self, dataset_path: str = "dataset/py_tasks.csv"):
+        self.loader = TaskLoader(dataset_path)
+        self.sandbox: Sandbox = None
+        self.current_task: dict = None
+        self.step_count: int = 0
+        self.cumulative_progress: float = 0.0
+        self.files_read: set = set()
+        self.episode_actions: list = []
+    def reset(self) -> FlakySleuthObservation:
+        # Cleanup previous episode
+        if self.sandbox:
+            self.sandbox.cleanup()
+        # Sample new task
+        self.current_task = self.loader.sample()
+        self.sandbox = Sandbox(self.current_task)
+        self.sandbox.setup()
+        # Reset episode state
+        self.step_count = 0
+        self.cumulative_progress = 0.0
+        self.files_read = set()
+        self.episode_actions = []
+        return self._make_obs()
+    def step(self, action: FlakySleuthAction):
+        self.step_count += 1
+        self.episode_actions.append(action)
+        tool_output = None
+        reward = 0.0
+        done = False
+        info = {}
+        TERMINAL_ACTIONS = ("classify_flakiness", "classify_root_cause", "propose_fix")
+        if action.action_type in TERMINAL_ACTIONS:
+            # Grade terminal action
+            terminal_score = grade_action(action, self.current_task)
+            # Late step penalty: -0.05 per step beyond 15
+            late_penalty = max(0, (self.step_count - 15)) * 0.05
+            # Wrong-direction penalty for T1
+            wrong_dir_penalty = 0.0
+            if (action.action_type == "classify_flakiness"
+                    and action.argument.lower() == "stable"
+                    and self.current_task.get("label") == "flaky"):
+                wrong_dir_penalty = 0.2
+            reward = min(1.0, max(0.0,
+                self.cumulative_progress + terminal_score
+                - late_penalty - wrong_dir_penalty
+            ))
+            done = True
+            info = {
+                "terminal_score": terminal_score,
+                "progress_score": self.cumulative_progress,
+                "late_penalty": late_penalty,
+                "task_type": self.current_task["task_type"],
+                "category": self.current_task["category"],
+            }
+        else:
+            # Exploratory action
+            tool_output, progress = self._execute_exploration(action)
+            self.cumulative_progress = min(0.30, self.cumulative_progress + progress)
+            reward = progress
+        obs = self._make_obs(tool_output)
+        return obs, reward, done, info
+    def state(self) -> dict:
+        return {
+            "repo_url": self.current_task["repo_url"] if self.current_task else None,
+            "test_name": self.current_task["test_name"] if self.current_task else None,
+            "task_type": self.current_task["task_type"] if self.current_task else None,
+            "step_count": self.step_count,
+            "files_read": list(self.files_read),
+            "cumulative_progress": self.cumulative_progress,
+        }
+    def _execute_exploration(self, action: FlakySleuthAction):
+        progress = 0.0
+        output = ""
+        if action.action_type == "read_file":
+            content = self.sandbox.read_file(action.argument)
+            if content is None:
+                output = f"ERROR: File not found: {action.argument}"
+                progress = -0.05  # hallucination penalty
+            elif action.argument in self.files_read:
+                output = content
+                progress = 0.0   # no reward for re-read
+            else:
+                self.files_read.add(action.argument)
+                output = content
+                progress = self._file_relevance_reward(action.argument)
+        elif action.action_type == "search_code":
+            output = self.sandbox.grep(action.argument)
+            progress = self._search_relevance_reward(action.argument)
+        elif action.action_type == "run_test":
+            output = self.sandbox.run_test(self.current_task["test_name"])
+            # Reward for actually running the test (shows initiative)
+            # But 0 if OD task (sandbox returns static message)
+            if self.current_task["category"] not in ("OD", "OD-Brit", "OD-Vic"):
+                progress = 0.05
+        return output, progress
+    def _file_relevance_reward(self, filepath: str) -> float:
+        task = self.current_task
+        test_file = task.get("test_file", "")
+        if test_file and test_file in filepath:
+            return 0.07   # reading the actual test file
+        if any(filepath.endswith(ext) for ext in (".py",)):
+            return 0.03   # any python file
+        return 0.01       # non-python file (requirements, config, etc.)
+    def _search_relevance_reward(self, pattern: str) -> float:
+        pattern_lower = pattern.lower()
+        if any(sig in pattern_lower for sig in FLAKY_SIGNAL_PATTERNS):
+            return 0.04   # searching for known flakiness signals
+        return 0.01       # generic search
+    def _make_obs(self, tool_output=None) -> FlakySleuthObservation:
+        task = self.current_task
+        return FlakySleuthObservation(
+            repo_url=task["repo_url"],
+            test_name=task["test_name"],
+            test_code=task.get("test_code", "")[:2000],
+            file_tree=self.sandbox.file_tree if self.sandbox else [],
+            tool_output=tool_output,
+            task_type=task["task_type"],
+            task_description=task["task_description"],
+            step_count=self.step_count,
+        )
+```
+---
+## 7. Graders
+### 7.1 Dispatcher (`graders/__init__.py`)
+```python
+from env.models import FlakySleuthAction
+from graders.task1_grader import grade as grade_t1
+from graders.task2_grader import grade as grade_t2
+from graders.task3_grader import grade as grade_t3
+def grade_action(action: FlakySleuthAction, task: dict) -> float:
+    tt = task["task_type"]
+    if tt == "classify":
+        return grade_t1(action, task)
+    elif tt == "root_cause":
+        return grade_t2(action, task)
+    elif tt == "fix_proposal":
+        return grade_t3(action, task)
+    return 0.0
+```
+### 7.2 Task 1 Grader (`graders/task1_grader.py`)
+```python
+from env.models import FlakySleuthAction
+def grade(action: FlakySleuthAction, task: dict) -> float:
+    """Binary classification: flaky or stable. Exact match only."""
+    if action.action_type != "classify_flakiness":
+        return 0.0
+    predicted = action.argument.strip().lower()
+    if predicted not in ("flaky", "stable"):
+        return 0.0
+    # All IDoFT rows are flaky; stable examples are synthetically added
+    # with label="stable" during dataset construction
+    ground_truth = task.get("label", "flaky")
+    return 1.0 if predicted == ground_truth else 0.0
+```
+### 7.3 Task 2 Grader (`graders/task2_grader.py`)
+```python
+import json
+import os
+from env.models import FlakySleuthAction
+# Load similarity matrix once at module level
+_SIM_PATH = os.path.join(os.path.dirname(__file__),
+                          "..", "dataset", "category_similarity.json")
+with open(_SIM_PATH) as f:
+    _RAW_SIM = json.load(f)
+def _get_similarity(pred: str, true: str) -> float:
+    if pred == true:
+        return 1.0
+    key1 = f"{pred},{true}"
+    key2 = f"{true},{pred}"
+    return _RAW_SIM.get(key1, _RAW_SIM.get(key2, 0.0))
+VALID_CATEGORIES = {
+    "OD", "OD-Brit", "OD-Vic", "NIO", "NOD",
+    "UD", "TD", "TZD", "ID", "NDOI", "NDOD", "OSD"
+}
+def grade(action: FlakySleuthAction, task: dict) -> float:
+    """
+    Root cause category classification.
+    Exact match = 1.0
+    Related category = partial credit via similarity matrix
+    Wrong family = 0.0
+    """
+    if action.action_type != "classify_root_cause":
+        return 0.0
+    predicted = action.argument.strip().upper()
+    # Handle common variations
+    predicted = predicted.replace(" ", "-")  # "OD Brit" → "OD-Brit"
+    if predicted not in VALID_CATEGORIES:
+        return 0.0   # invalid category string
+    # Take primary category from dataset (first if semicolon-separated)
+    true_category = str(task.get("category", "")).split(";")[0].strip().upper()
+    return _get_similarity(predicted, true_category)
+```
+### 7.4 Task 3 Grader (`graders/task3_grader.py`)
+```python
+import subprocess
+import tempfile
+import os
+import json
+from openai import OpenAI
+from env.models import FlakySleuthAction
+CATEGORY_DESCRIPTIONS = {
+    "TD":   "Time-Dependent: test fails due to reliance on wall-clock time",
+    "TZD":  "Timezone-Dependent: test fails in different timezones",
+    "NOD":  "Non-Deterministic: test fails due to randomness or non-determinism",
+    "NIO":  "Non-Idempotent-Outcome: test passes first run but fails on second run",
+    "ID":   "Implementation-Dependent: test fails due to language/runtime non-determinism (e.g. dict ordering)",
+}
+EXPECTED_FIX_PATTERNS = {
+    "TD":   ["freeze_time", "mock", "patch", "utcnow", "datetime", "monkeypatch"],
+    "TZD":  ["timezone", "utc", "pytz", "zoneinfo", "tzinfo", "UTC"],
+    "NOD":  ["seed", "mock", "patch", "deterministic", "sorted"],
+    "NIO":  ["setUp", "tearDown", "fixture", "yield", "cleanup", "autouse"],
+    "ID":   ["sorted(", "list(", "frozenset", "OrderedDict"],
+}
+def grade(action: FlakySleuthAction, task: dict) -> float:
+    """
+    Fix proposal grader.
+    Component A: Pattern check     — 0.35 weight
+    Component B: Diff applies      — 0.25 weight
+    Component C: LLM judge         — 0.40 weight
+    """
+    if action.action_type != "propose_fix":
+        return 0.0
+    proposed_fix = action.argument.strip()
+    if not proposed_fix:
+        return 0.0
+    category = str(task.get("category", "")).split(";")[0].strip().upper()
+    known_fix = task.get("known_fix_diff", "") or ""
+    test_code = task.get("test_code", "") or ""
+    # ── Component A: Pattern check ────────────────────────────────
+    patterns = EXPECTED_FIX_PATTERNS.get(category, [])
+    if patterns:
+        matches = sum(1 for p in patterns if p in proposed_fix)
+        pattern_score = min(1.0, matches / max(1, len(patterns) * 0.4))
+    else:
+        pattern_score = 0.5
+    # ── Component B: Diff applies cleanly ─────────────────────────
+    apply_score = _check_diff_applies(proposed_fix, task)
+    # ── Component C: LLM judge ────────────────────────────────────
+    judge_score = _llm_judge(proposed_fix, known_fix, category, test_code)
+    total = (0.35 * pattern_score) + (0.25 * apply_score) + (0.40 * judge_score)
+    return round(min(1.0, max(0.0, total)), 4)
+def _check_diff_applies(fix: str, task: dict) -> float:
+    """Try a dry-run patch application against the test file in a temp copy."""
+    try:
+        test_file = task.get("test_file", "")
+        sandbox_path = task.get("sandbox_test_path", "")
+        if not sandbox_path or not os.path.exists(sandbox_path):
+            return 0.3  # can't verify, neutral-ish
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".patch", delete=False) as f:
+            f.write(fix)
+            patch_path = f.name
+        result = subprocess.run(
+            ["patch", "--dry-run", "-p1", sandbox_path, patch_path],
+            capture_output=True, text=True, timeout=10
+        )
+        os.unlink(patch_path)
+        return 1.0 if result.returncode == 0 else 0.0
+    except Exception:
+        return 0.3  # can't verify, neutral
+def _llm_judge(proposed: str, known: str, category: str, test_code: str) -> float:
+    """Call the LLM judge via OpenAI-compatible API."""
+    client = OpenAI(
+        api_key=os.environ.get("OPENAI_API_KEY", ""),
+        base_url=os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
+    )
+    model = os.environ.get("MODEL_NAME", "gpt-4o-mini")
+    cat_desc = CATEGORY_DESCRIPTIONS.get(category, f"Flakiness category: {category}")
+    known_section = f"Known accepted fix (from merged PR):\n```\n{known[:800]}\n```" if known else "Known fix: Not available"
+    prompt = f"""You are evaluating a proposed fix for a flaky Python test.
+Flakiness category: {category}
+What this means: {cat_desc}
+Original flaky test code:
+```python
+{test_code[:1000]}
+```
+Proposed fix (unified diff):
+```
+{proposed[:1000]}
+```
+{known_section}
+Score the proposed fix from 0 to 10:
+- 0–2: Fix is wrong, irrelevant, or makes things worse
+- 3–5: Fix partially addresses the issue but misses root cause
+- 6–8: Fix correctly addresses root cause with minor issues
+- 9–10: Fix is correct, clean, minimal, and addresses root cause completely
+Respond ONLY with a JSON object and nothing else:
+{{"score": <integer 0-10>, "reason": "<one sentence explanation>"}}"""
+    try:
+        resp = client.chat.completions.create(
+            model=model,
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=100,
+            temperature=0.0,
+        )
+        raw = resp.choices[0].message.content.strip()
+        # Strip markdown fences if present
+        raw = raw.replace("```json", "").replace("```", "").strip()
+        data = json.loads(raw)
+        score = int(data["score"])
+        return max(0.0, min(10.0, score)) / 10.0
+    except Exception:
+        return 0.5  # fallback neutral on any failure
+```
+---
+## 8. OpenEnv HTTP Server (`server.py`)
+```python
+from fastapi import FastAPI, HTTPException
+from env.models import FlakySleuthObservation, FlakySleuthAction
+from env.environment import FlakySleuthEnv
+app = FastAPI(title="FlakySleuth Environment")
+env = FlakySleuthEnv()
+@app.post("/reset")
+def reset() -> FlakySleuthObservation:
+    return env.reset()
+@app.post("/step")
+def step(action: FlakySleuthAction):
+    obs, reward, done, info = env.step(action)
+    return {
+        "observation": obs.dict(),
+        "reward": reward,
+        "done": done,
+        "info": info,
+    }
+@app.get("/state")
+def state():
+    return env.state()
+@app.get("/health")
+def health():
+    return {"status": "ok"}
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)
+```
+---
+## 9. `openenv.yaml`
+```yaml
+name: flaky-sleuth-env
+version: 0.1.0
+description: >
+  An RL environment where an LLM agent investigates flaky tests in real
+  Python GitHub repositories. The agent uses tool calls to read code,
+  search for patterns, and run tests — then produces a verdict (classify,
+  root cause, or fix). Tasks range from binary flakiness classification
+  to proposing concrete code fixes verified by a hybrid grader.
+observation_type: FlakySleuthObservation
+action_type: FlakySleuthAction
+reward_range: [0.0, 1.0]
+tasks:
+  - id: task1_classify
+    name: "Flaky vs. Stable Classification"
+    difficulty: easy
+    description: >
+      Given a test from a real Python repo, classify it as flaky or stable.
+      Agent must call classify_flakiness with argument 'flaky' or 'stable'.
+  - id: task2_root_cause
+    name: "Root Cause Category Identification"
+    difficulty: medium
+    description: >
+      Given a confirmed flaky test, identify the root cause category
+      (OD, NOD, TD, TZD, NIO, ID, etc.) via static code analysis.
+  - id: task3_fix_proposal
+    name: "Fix Proposal"
+    difficulty: hard
+    description: >
+      Given a confirmed flaky test and its root cause, propose a concrete
+      fix as a unified diff. Evaluated by pattern matching + LLM judge.
+episode_max_steps: 20
+baseline_script: inference.py
+infra:
+  vcpu: 2
+  memory_gb: 8
+  max_inference_minutes: 20
+```
+---
+## 10. Baseline Inference Script (`inference.py`)
+**CRITICAL:** Must be named exactly `inference.py` in the root directory. Must use OpenAI client. Must read `API_BASE_URL`, `MODEL_NAME`, `OPENAI_API_KEY` from environment variables.
+```python
+"""
+FlakySleuth baseline inference script.
+Required environment variables:
+  OPENAI_API_KEY  — API key
+  API_BASE_URL    — LLM endpoint (default: https://api.openai.com/v1)
+  MODEL_NAME      — Model identifier (default: gpt-4o-mini)
+Runs 5 episodes × 3 task types = 15 total episodes.
+Prints average score per task type.
+Must complete in under 20 minutes on vcpu=2, 8GB RAM.
+"""
+import os
+import json
+from openai import OpenAI
+from env.environment import FlakySleuthEnv
+from env.models import FlakySleuthAction
+# ── Configuration ──────────────────────────────────────────────────
+API_KEY      = os.environ.get("OPENAI_API_KEY", "")
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME   = os.environ.get("MODEL_NAME", "gpt-4o-mini")
+EPISODES_PER_TASK = 5
+client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
+# ── System prompt (teaches the model your tool interface) ──────────
+SYSTEM_PROMPT = """You are a flaky test detective. You investigate Python tests in real GitHub repositories.
+At each step, respond ONLY with a single valid JSON object — no explanation, no markdown, no extra text.
+Available actions:
+EXPLORATORY (use these to gather evidence):
+{"action_type": "read_file", "argument": "relative/path/to/file.py"}
+{"action_type": "search_code", "argument": "pattern_to_grep_for"}
+{"action_type": "run_test", "argument": ""}
+TERMINAL (use exactly one of these to end the episode):
+{"action_type": "classify_flakiness", "argument": "flaky"}
+{"action_type": "classify_flakiness", "argument": "stable"}
+{"action_type": "classify_root_cause", "argument": "OD"}
+{"action_type": "classify_root_cause", "argument": "NOD"}
+{"action_type": "classify_root_cause", "argument": "TD"}
+{"action_type": "classify_root_cause", "argument": "TZD"}
+{"action_type": "classify_root_cause", "argument": "NIO"}
+{"action_type": "classify_root_cause", "argument": "ID"}
+{"action_type": "classify_root_cause", "argument": "OD-Brit"}
+{"action_type": "classify_root_cause", "argument": "OD-Vic"}
+{"action_type": "propose_fix", "argument": "--- a/path\\n+++ b/path\\n@@ ... @@\\n-old line\\n+new line"}
+RULES:
+1. Always read the test file first before making a terminal decision.
+2. Search for flakiness signals: sleep, random, time, datetime, thread, os.environ, shared state.
+3. For order-dependent (OD) tests, run_test is disabled — use static analysis only.
+4. Call a terminal action only when you have enough evidence.
+5. Respond with ONLY valid JSON. Nothing else."""
+def obs_to_prompt(obs) -> str:
+    return f"""TASK: {obs.task_description}
+Repository: {obs.repo_url}
+Test name: {obs.test_name}
+Step: {obs.step_count}/20
+Test source code:
+```python
+{obs.test_code}
+```
+Repository file tree (top-level):
+{chr(10).join(obs.file_tree[:40])}
+Result of your last action:
+{obs.tool_output or "(No action taken yet — this is the start of the episode)"}
+What is your next action? Respond with JSON only."""
+def run_episode(env: FlakySleuthEnv) -> float:
+    obs = env.reset()
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": obs_to_prompt(obs)},
+    ]
+    total_reward = 0.0
+    for step in range(20):
+        try:
+            resp = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=messages,
+                max_tokens=400,
+                temperature=0.0,
+            )
+            raw = resp.choices[0].message.content.strip()
+            messages.append({"role": "assistant", "content": raw})
+            # Parse action
+            clean = raw.replace("```json", "").replace("```", "").strip()
+            action_dict = json.loads(clean)
+            action = FlakySleuthAction(**action_dict)
+        except json.JSONDecodeError:
+            # Model produced non-JSON — inject correction message
+            messages.append({
+                "role": "user",
+                "content": "ERROR: Your response was not valid JSON. "
+                           "Respond ONLY with a JSON object as specified."
+            })
+            continue
+        except Exception as e:
+            print(f"  Step {step} error: {e}")
+            break
+        obs, reward, done, info = env.step(action)
+        total_reward += reward
+        if done:
+            print(f"  Terminal: {action.action_type}({action.argument[:50]}) "
+                  f"→ terminal={info.get('terminal_score', 0):.2f} "
+                  f"progress={info.get('progress_score', 0):.2f} "
+                  f"total={total_reward:.2f}")
+            break
+        messages.append({"role": "user", "content": obs_to_prompt(obs)})
+    return total_reward
+def main():
+    env = FlakySleuthEnv()
+    results = {"classify": [], "root_cause": [], "fix_proposal": []}
+    for task_type in results.keys():
+        print(f"\n── Task type: {task_type} ──")
+        env.loader.force_task_type(task_type)
+        for ep in range(EPISODES_PER_TASK):
+            score = run_episode(env)
+            results[task_type].append(score)
+            print(f"  Episode {ep+1}: {score:.3f}")
+    print("\n══ BASELINE RESULTS ══")
+    for task_type, scores in results.items():
+        avg = sum(scores) / len(scores)
+        print(f"  {task_type:15s}: avg={avg:.3f}  scores={[round(s,3) for s in scores]}")
+    overall = sum(s for scores in results.values() for s in scores)
+    overall /= sum(len(v) for v in results.values())
+    print(f"  {'OVERALL':15s}: avg={overall:.3f}")
+if __name__ == "__main__":
+    main()
+```
+---
+## 11. Dockerfile
+```dockerfile
+FROM python:3.11-slim
+# Install git and patch (needed for sandbox)
+RUN apt-get update && apt-get install -y \
+    git \
+    patch \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+# Copy requirements first for layer caching
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy everything else
+COPY . .
+# Expose port for HF Spaces
+EXPOSE 7860
+# Start FastAPI server
+CMD ["python", "server.py"]
+```
+---
+## 12. `requirements.txt`
+```
+fastapi>=0.110.0
+uvicorn>=0.27.0
+pydantic>=2.0.0
+openai>=1.0.0
+pandas>=2.0.0
+gitpython>=3.1.0
+pytest>=7.0.0
+pytest-timeout>=2.0.0
+requests>=2.31.0
+```
+---
+## 13. Build Order (Day-by-Day Sprint)
+```
+DAY 1 — Data Foundation
+────────────────────────
+□ Clone idoft repo, inspect py-data.csv manually
+□ Run build_dataset.py offline (set GITHUB_TOKEN)
+□ Verify py_tasks.csv has rows for all 3 task types
+□ Manually inspect 5-10 rows to sanity check test_code and known_fix_diff
+□ Build category_similarity.json
+DAY 2 — Core Environment
+──────────────────────────
+□ Implement env/models.py (Pydantic models)
+□ Implement env/sandbox.py (clone, read_file, grep, run_test)
+□ Test sandbox.py manually on 2-3 real repos
+□ Implement env/task_loader.py
+□ Implement env/environment.py (reset, step, state)
+□ Write a quick smoke test: reset() → 3 steps → terminal action
+DAY 3 — Graders
+────────────────
+□ Implement graders/task1_grader.py
+□ Implement graders/task2_grader.py + verify similarity matrix
+□ Implement graders/task3_grader.py (pattern + diff + LLM judge)
+□ Unit test all 3 graders with hardcoded inputs
+□ Verify scores are always in [0.0, 1.0]
+DAY 4 — Server + Spec Compliance
+──────────────────────────────────
+□ Implement server.py (FastAPI: /reset, /step, /state, /health)
+□ Write openenv.yaml
+□ Run openenv validate — fix any errors
+□ Build Dockerfile locally: docker build . && docker run -p 7860:7860
+□ Test endpoints with curl
+DAY 5 — Inference Script + Deploy
+────────────────────────────────────
+□ Implement inference.py (ReAct loop, OpenAI client)
+□ Run inference.py locally against real API
+□ Verify it completes in <20 min, produces scores for all 3 task types
+□ Deploy to Hugging Face Spaces
+□ Verify HF Space returns 200 on health check and responds to reset()
+□ Run pre-submission validation script
+DAY 6 — Polish + Submit
+─────────────────────────
+□ Write README (env description, observation/action spaces, setup)
+□ Run full baseline one more time, record scores
+□ Submit HF Space URL before April 8 11:59 PM IST
+```
+---
+## 14. Pre-Submission Checklist (from Official Spec)
+```
+□ HF Space deploys and returns 200 on automated ping
+□ reset() responds correctly
+□ openenv validate passes (openenv.yaml + typed models + step/reset/state)
+□ docker build succeeds on submitted repo
+□ inference.py runs without error and produces scores
+□ 3 tasks with graders, all scores in 0.0–1.0
+□ API_BASE_URL, MODEL_NAME, OPENROUTER_API_KEY env vars defined
+□ Inference script is named exactly inference.py in root directory
+□ All LLM calls use OpenAI client with those env vars
+□ Runtime < 20 min on vcpu=2, 8GB RAM
+```
+---
+## 15. Key Design Decisions Summary (for context)
+| Decision | Choice | Reason |
+|---|---|---|
+| Language | Python only | Fast sandboxing, clean IDoFT data, no JVM overhead |
+| Dataset | IDoFT py-data.csv + category codes | Real repos, ground truth categories, PR-linked fixes |
+| OD tests in T3 | Excluded | Cannot verify fix without multi-order test execution |
+| OD tests in T1/T2 | Included | Static code analysis is a valid proxy |
+| T2 grader | Similarity matrix | Some wrong answers are more wrong than others |
+| T3 grader | Hybrid (pattern + diff + LLM judge) | Pure string match unfair; pure LLM judge non-deterministic |
+| Reward shaping | Step-level progress rewards | Prevents sparse reward, rewards good investigative behavior |
+| Max steps | 20 | Balances exploration depth vs infra time constraints |
+| Progress reward cap | 0.30 | Terminal score (0.70 max) dominates; exploration is supporting signal |

graders/__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+from __future__ import annotations
+from env.models import FlakySleuthAction
+from graders.task1_grader import grade as grade_t1
+from graders.task2_grader import grade as grade_t2
+from graders.task3_grader import grade as grade_t3
+def grade_action(action: FlakySleuthAction, task: dict) -> float:
+    task_type = task.get("task_type")
+    if task_type == "classify":
+        return grade_t1(action, task)
+    if task_type == "root_cause":
+        return grade_t2(action, task)
+    if task_type == "fix_proposal":
+        return grade_t3(action, task)
+    return 0.0

graders/task1_grader.py ADDED Viewed

	@@ -0,0 +1,16 @@

+from __future__ import annotations
+from env.models import FlakySleuthAction
+def grade(action: FlakySleuthAction, task: dict) -> float:
+    """Binary classification: flaky or stable. Exact match only."""
+    if action.action_type != "classify_flakiness":
+        return 0.0
+    predicted = action.argument.strip().lower()
+    if predicted not in ("flaky", "stable"):
+        return 0.0
+    ground_truth = str(task.get("label", "flaky")).strip().lower() or "flaky"
+    return 1.0 if predicted == ground_truth else 0.0

graders/task2_grader.py ADDED Viewed

	@@ -0,0 +1,59 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+from env.models import FlakySleuthAction
+_SIM_PATH = Path(__file__).resolve().parent.parent / "dataset" / "category_similarity.json"
+with _SIM_PATH.open("r", encoding="utf-8") as handle:
+    _RAW_SIM = json.load(handle)
+_CANONICAL = {
+    "OD": "OD",
+    "OD-BRIT": "OD-Brit",
+    "OD-VIC": "OD-Vic",
+    "NIO": "NIO",
+    "NOD": "NOD",
+    "UD": "UD",
+    "TD": "TD",
+    "TZD": "TZD",
+    "ID": "ID",
+    "NDOI": "NDOI",
+    "NDOD": "NDOD",
+    "OSD": "OSD",
+}
+VALID_CATEGORIES = set(_CANONICAL.values())
+def _normalize_category(value: str) -> str:
+    text = value.strip().replace("_", "-").replace(" ", "-")
+    upper = text.upper()
+    return _CANONICAL.get(upper, "")
+def _get_similarity(predicted: str, truth: str) -> float:
+    if predicted == truth:
+        return 1.0
+    key_a = f"{predicted},{truth}"
+    key_b = f"{truth},{predicted}"
+    return float(_RAW_SIM.get(key_a, _RAW_SIM.get(key_b, 0.0)))
+def grade(action: FlakySleuthAction, task: dict) -> float:
+    """Root cause category classification with matrix-based partial credit."""
+    if action.action_type != "classify_root_cause":
+        return 0.0
+    predicted = _normalize_category(action.argument)
+    if predicted not in VALID_CATEGORIES:
+        return 0.0
+    raw_truth = str(task.get("category", "")).split(";")[0]
+    truth = _normalize_category(raw_truth)
+    if truth not in VALID_CATEGORIES:
+        return 0.0
+    return _get_similarity(predicted, truth)

graders/task3_grader.py ADDED Viewed

	@@ -0,0 +1,161 @@

+from __future__ import annotations
+import json
+import os
+import subprocess
+import tempfile
+from pathlib import Path
+from openai import OpenAI
+from env.models import FlakySleuthAction
+CATEGORY_DESCRIPTIONS = {
+    "TD": "Time-Dependent: fails due to wall-clock time assumptions",
+    "TZD": "Timezone-Dependent: fails across timezone settings",
+    "NOD": "Non-Deterministic: fails due to randomness/non-determinism",
+    "NIO": "Non-Idempotent-Outcome: passes first run, fails on repeated run",
+    "ID": "Implementation-Dependent: fails due to runtime implementation details",
+}
+EXPECTED_FIX_PATTERNS = {
+    "TD": ["freeze_time", "mock", "patch", "utcnow", "datetime", "monkeypatch"],
+    "TZD": ["timezone", "utc", "pytz", "zoneinfo", "tzinfo", "UTC"],
+    "NOD": ["seed", "mock", "patch", "deterministic", "sorted"],
+    "NIO": ["setup", "teardown", "fixture", "yield", "cleanup", "autouse"],
+    "ID": ["sorted(", "list(", "frozenset", "OrderedDict"],
+}
+def grade(action: FlakySleuthAction, task: dict) -> float:
+    """Hybrid fixer grader: pattern + dry-run apply + LLM judge."""
+    if action.action_type != "propose_fix":
+        return 0.0
+    proposed_fix = action.argument.strip()
+    if not proposed_fix:
+        return 0.0
+    category = str(task.get("category", "")).split(";")[0].strip().upper()
+    known_fix = task.get("known_fix_diff", "") or ""
+    test_code = task.get("test_code", "") or ""
+    patterns = EXPECTED_FIX_PATTERNS.get(category, [])
+    if patterns:
+        matches = sum(
+            1 for pattern in patterns if pattern.lower() in proposed_fix.lower()
+        )
+        pattern_score = min(1.0, matches / max(1, len(patterns) * 0.4))
+    else:
+        pattern_score = 0.5
+    apply_score = _check_diff_applies(proposed_fix, task)
+    judge_score = _llm_judge(proposed_fix, known_fix, category, test_code)
+    total = (0.35 * pattern_score) + (0.25 * apply_score) + (0.40 * judge_score)
+    return round(min(1.0, max(0.0, total)), 4)
+def _check_diff_applies(diff_text: str, task: dict) -> float:
+    if "+++" not in diff_text or "---" not in diff_text:
+        return 0.0
+    repo_root = str(task.get("sandbox_root", "")).strip()
+    if not repo_root or not Path(repo_root).exists():
+        return 0.3
+    patch_path = None
+    try:
+        with tempfile.NamedTemporaryFile(
+            mode="w", suffix=".patch", delete=False
+        ) as handle:
+            handle.write(diff_text)
+            patch_path = handle.name
+        result = subprocess.run(
+            ["patch", "--dry-run", "-p1", "-i", patch_path],
+            cwd=repo_root,
+            capture_output=True,
+            text=True,
+            timeout=10,
+        )
+        return 1.0 if result.returncode == 0 else 0.0
+    except Exception:
+        return 0.3
+    finally:
+        if patch_path and os.path.exists(patch_path):
+            os.unlink(patch_path)
+def _llm_judge(proposed: str, known: str, category: str, test_code: str) -> float:
+    openrouter_key = os.environ.get("OPENROUTER_API_KEY")
+    openai_key = os.environ.get("OPENAI_API_KEY")
+    raw_api_key = os.environ.get("API_KEY")
+    api_key = (raw_api_key or openrouter_key or openai_key or "").strip()
+    if not api_key:
+        return 0.5
+    using_openrouter = (openrouter_key and not raw_api_key and not openai_key) or (
+        raw_api_key and raw_api_key.startswith("sk-or-") and not openai_key
+    )
+    default_base_url = (
+        "https://openrouter.ai/api/v1"
+        if using_openrouter
+        else "https://api.openai.com/v1"
+    )
+    api_base_url = os.environ.get("API_BASE_URL", default_base_url)
+    client = OpenAI(api_key=api_key, base_url=api_base_url)
+    model = os.environ.get(
+        "MODEL_NAME",
+        "qwen/qwen3.6-plus:free"
+        if api_base_url.startswith("https://openrouter.ai")
+        else "gpt-4o-mini",
+    )
+    cat_desc = CATEGORY_DESCRIPTIONS.get(category, f"Flakiness category: {category}")
+    if known:
+        known_section = f"Known accepted fix (from merged PR):\n```\n{known[:800]}\n```"
+    else:
+        known_section = "Known fix: Not available"
+    prompt = f"""You are evaluating a proposed fix for a flaky Python test.
+Flakiness category: {category}
+What this means: {cat_desc}
+Original flaky test code:
+```python
+{test_code[:1000]}
+```
+Proposed fix (unified diff):
+```
+{proposed[:1000]}
+```
+{known_section}
+Score the proposed fix from 0 to 10:
+- 0-2: Fix is wrong, irrelevant, or harmful
+- 3-5: Fix partially addresses the issue but misses root cause
+- 6-8: Fix addresses root cause with minor issues
+- 9-10: Fix is correct, minimal, and complete
+Respond ONLY with JSON:
+{{"score": <integer 0-10>, "reason": "<one sentence>"}}"""
+    try:
+        response = client.chat.completions.create(
+            model=model,
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=120,
+            temperature=0.0,
+        )
+        raw = (response.choices[0].message.content or "").strip()
+        raw = raw.replace("```json", "").replace("```", "").strip()
+        payload = json.loads(raw)
+        score = int(payload.get("score", 5))
+        return max(0.0, min(10.0, score)) / 10.0
+    except Exception:
+        return 0.5

inference.py ADDED Viewed

	@@ -0,0 +1,298 @@

+"""FlakySleuth compliance inference script.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+from typing import Any
+from openai import OpenAI
+from env.environment import FlakySleuthEnv
+from env.models import FlakySleuthAction, FlakySleuthObservation
+HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
+OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
+OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
+RAW_API_KEY = os.environ.get("API_KEY")
+API_KEY = RAW_API_KEY or HF_TOKEN or OPENROUTER_API_KEY or OPENAI_API_KEY or ""
+def _looks_like_openrouter_key(key: str | None) -> bool:
+    return bool(key and key.startswith("sk-or-"))
+DEFAULT_BASE_URL = (
+    "https://router.huggingface.co/v1"
+    if (HF_TOKEN and not RAW_API_KEY and not OPENROUTER_API_KEY and not OPENAI_API_KEY)
+    else (
+        "https://openrouter.ai/api/v1"
+        if (
+            (OPENROUTER_API_KEY and not RAW_API_KEY and not OPENAI_API_KEY)
+            or (_looks_like_openrouter_key(RAW_API_KEY) and not OPENAI_API_KEY)
+        )
+        else "https://api.openai.com/v1"
+    )
+)
+API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
+DEFAULT_MODEL = (
+    "openai/gpt-oss-120b:novita"
+    if API_BASE_URL.startswith("https://router.huggingface.co")
+    else ("qwen/qwen3.6-plus:free" if API_BASE_URL.startswith("https://openrouter.ai") else "gpt-4o-mini")
+)
+MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
+EPISODES_PER_TASK = 5
+MAX_STEPS = 20
+BENCHMARK_NAME = "flakysleuth"
+client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
+SYSTEM_PROMPT = """You are a flaky test detective.
+Respond ONLY with a single valid JSON object.
+Exploration actions:
+{"action_type": "read_file", "argument": "relative/path.py"}
+{"action_type": "search_code", "argument": "pattern"}
+{"action_type": "run_test", "argument": ""}
+Terminal actions:
+{"action_type": "classify_flakiness", "argument": "flaky"}
+{"action_type": "classify_flakiness", "argument": "stable"}
+{"action_type": "classify_root_cause", "argument": "OD"}
+{"action_type": "classify_root_cause", "argument": "OD-Brit"}
+{"action_type": "classify_root_cause", "argument": "OD-Vic"}
+{"action_type": "classify_root_cause", "argument": "NIO"}
+{"action_type": "classify_root_cause", "argument": "NOD"}
+{"action_type": "classify_root_cause", "argument": "TD"}
+{"action_type": "classify_root_cause", "argument": "TZD"}
+{"action_type": "classify_root_cause", "argument": "ID"}
+{"action_type": "propose_fix", "argument": "--- a/file.py\\n+++ b/file.py\\n@@ ... @@\\n-old\\n+new"}
+Rules:
+1. Read the test file first.
+2. Search for flaky signals: random, time, sleep, shared state, env vars.
+3. Run the test for non-order-dependent scenarios.
+4. Call one terminal action when confident.
+"""
+def _single_line(text: str) -> str:
+    return " ".join(str(text).split())
+def log_start(task: str, env_name: str, model: str) -> None:
+    print(f"[START] task={task} env={env_name} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: str | None) -> None:
+    error_value = _single_line(error) if error else "null"
+    done_value = str(bool(done)).lower()
+    print(
+        f"[STEP] step={step} action={_single_line(action)} "
+        f"reward={reward:.2f} done={done_value} error={error_value}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(bool(success)).lower()} steps={steps} "
+        f"score={score:.2f} rewards={rewards_str}",
+        flush=True,
+    )
+def obs_to_prompt(obs: FlakySleuthObservation, max_steps: int) -> str:
+    tree_preview = "\n".join(obs.file_tree[:40])
+    return f"""TASK: {obs.task_description}
+Repository: {obs.repo_url}
+Test name: {obs.test_name}
+Step: {obs.step_count}/{max_steps}
+Test source code:
+```python
+{obs.test_code}
+```
+Repository file tree:
+{tree_preview}
+Last tool output:
+{obs.tool_output or '(No action taken yet)'}
+Return only JSON action."""
+def heuristic_action(obs: FlakySleuthObservation) -> FlakySleuthAction:
+    if obs.step_count == 0 and obs.file_tree:
+        return FlakySleuthAction(action_type="read_file", argument=obs.file_tree[0])
+    if obs.step_count < 2:
+        return FlakySleuthAction(action_type="search_code", argument="random")
+    if obs.task_type == "classify":
+        return FlakySleuthAction(action_type="classify_flakiness", argument="flaky")
+    if obs.task_type == "root_cause":
+        return FlakySleuthAction(action_type="classify_root_cause", argument="NOD")
+    return FlakySleuthAction(
+        action_type="propose_fix",
+        argument=(
+            "--- a/src/math_utils.py\n"
+            "+++ b/src/math_utils.py\n"
+            "@@\n"
+            "-def unstable_sum(values):\n"
+            "-    random.shuffle(values)\n"
+            "-    return values[0] + values[1]\n"
+            "+def unstable_sum(values):\n"
+            "+    ordered = sorted(values)\n"
+            "+    return ordered[0] + ordered[1]\n"
+        ),
+    )
+def llm_action(messages: list[dict[str, str]]) -> FlakySleuthAction | None:
+    if not API_KEY:
+        return None
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=messages,
+        max_tokens=400,
+        temperature=0.0,
+    )
+    raw = (response.choices[0].message.content or "").strip()
+    cleaned = raw.replace("```json", "").replace("```", "").strip()
+    payload = json.loads(cleaned)
+    return FlakySleuthAction.model_validate(payload)
+def run_episode(
+    env: FlakySleuthEnv,
+    *,
+    task_name: str,
+    benchmark_name: str,
+    max_steps: int,
+) -> float:
+    rewards: list[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=task_name, env_name=benchmark_name, model=MODEL_NAME)
+    try:
+        obs = env.reset()
+        messages: list[dict[str, str]] = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": obs_to_prompt(obs, max_steps)},
+        ]
+        for step_idx in range(1, max_steps + 1):
+            try:
+                action = llm_action(messages) or heuristic_action(obs)
+            except Exception:
+                action = heuristic_action(obs)
+            obs, reward, done, info = env.step(action)
+            rewards.append(float(reward or 0.0))
+            steps_taken = step_idx
+            step_error: str | None = None
+            if isinstance(info, dict):
+                last_action_error = info.get("last_action_error")
+                if last_action_error:
+                    step_error = str(last_action_error)
+            log_step(
+                step=step_idx,
+                action=action.model_dump_json(),
+                reward=float(reward or 0.0),
+                done=bool(done),
+                error=step_error,
+            )
+            if done:
+                score = float(reward or 0.0)
+                break
+            messages.append({"role": "assistant", "content": action.model_dump_json()})
+            messages.append({"role": "user", "content": obs_to_prompt(obs, max_steps)})
+        score = min(max(score, 0.0), 1.0)
+        success = score > 0.0
+    except Exception:
+        score = 0.0
+        success = False
+    finally:
+        try:
+            env.close()
+        except Exception:
+            pass
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return score
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run FlakySleuth compliance inference.")
+    parser.add_argument(
+        "--dataset-path",
+        default="dataset/py_tasks.csv",
+        help="Processed task CSV used by the environment.",
+    )
+    parser.add_argument(
+        "--episodes-per-task",
+        type=int,
+        default=EPISODES_PER_TASK,
+        help="Episodes per task type.",
+    )
+    parser.add_argument(
+        "--task-types",
+        default="classify,root_cause,fix_proposal",
+        help="Comma-separated task types to run (classify,root_cause,fix_proposal).",
+    )
+    parser.add_argument(
+        "--max-steps",
+        type=int,
+        default=MAX_STEPS,
+        help="Max steps per episode.",
+    )
+    parser.add_argument(
+        "--benchmark-name",
+        default=BENCHMARK_NAME,
+        help="Benchmark label for [START] lines.",
+    )
+    return parser.parse_args()
+def main() -> None:
+    args = _parse_args()
+    env = FlakySleuthEnv(dataset_path=args.dataset_path, max_steps=args.max_steps)
+    allowed_task_types = {"classify", "root_cause", "fix_proposal"}
+    task_types = [t.strip() for t in args.task_types.split(",") if t.strip()]
+    if not task_types:
+        return
+    for task_type in task_types:
+        if task_type not in allowed_task_types:
+            continue
+        env.loader.force_task_type(task_type)
+        for _ in range(args.episodes_per_task):
+            run_episode(
+                env,
+                task_name=task_type,
+                benchmark_name=args.benchmark_name,
+                max_steps=args.max_steps,
+            )
+if __name__ == "__main__":
+    main()

inference_compliance.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

inference_debug.py ADDED Viewed

	@@ -0,0 +1,606 @@

+"""FlakySleuth baseline inference script.
+Environment variables:
+  Preferred:
+    HF_TOKEN / HUGGINGFACE_HUB_TOKEN (or OPENROUTER_API_KEY / API_KEY)
+    API_BASE_URL (optional, defaults to https://openrouter.ai/api/v1 for router-style keys)
+    MODEL_NAME (optional, defaults to qwen/qwen3.6-plus:free on OpenRouter)
+  Optional fallback:
+    OPENAI_API_KEY
+    API_BASE_URL (defaults to https://api.openai.com/v1 when OpenAI key is used)
+    MODEL_NAME (defaults to gpt-4o-mini for OpenAI)
+"""
+from __future__ import annotations
+import json
+import os
+import argparse
+import time
+from collections import defaultdict
+from pathlib import Path
+from typing import Any
+from openai import OpenAI
+try:
+    from tqdm import tqdm
+except Exception:  # pragma: no cover
+    tqdm = None
+from env.environment import FlakySleuthEnv
+from env.models import FlakySleuthAction, FlakySleuthObservation
+OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
+OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
+HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
+RAW_API_KEY = os.environ.get("API_KEY")
+API_KEY = RAW_API_KEY or OPENROUTER_API_KEY or OPENAI_API_KEY or HF_TOKEN or ""
+def _looks_like_openrouter_key(key: str | None) -> bool:
+    return bool(key and key.startswith("sk-or-"))
+DEFAULT_BASE_URL = (
+    "https://router.huggingface.co/v1"
+    if (
+        HF_TOKEN
+        and not RAW_API_KEY
+        and not OPENROUTER_API_KEY
+        and not OPENAI_API_KEY
+    )
+    else (
+    "https://openrouter.ai/api/v1"
+    if (
+        (OPENROUTER_API_KEY and not RAW_API_KEY and not OPENAI_API_KEY)
+        or (_looks_like_openrouter_key(RAW_API_KEY) and not OPENAI_API_KEY)
+    )
+    else "https://api.openai.com/v1"
+    )
+)
+API_BASE_URL = os.environ.get("API_BASE_URL", DEFAULT_BASE_URL)
+DEFAULT_MODEL = (
+    "openai/gpt-oss-120b:novita"
+    if API_BASE_URL.startswith("https://router.huggingface.co")
+    else (
+    "qwen/qwen3.6-plus:free"
+    if API_BASE_URL.startswith("https://openrouter.ai")
+    else "gpt-4o-mini"
+    )
+)
+MODEL_NAME = os.environ.get("MODEL_NAME", DEFAULT_MODEL)
+# Keep a conservative default to stay under common hackathon runtime limits.
+EPISODES_PER_TASK = 2
+MAX_STEPS = 20
+client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
+SYSTEM_PROMPT = """You are a flaky test detective.
+Respond ONLY with a single valid JSON object.
+Exploration actions:
+{"action_type": "read_file", "argument": "relative/path.py"}
+{"action_type": "search_code", "argument": "pattern"}
+{"action_type": "run_test", "argument": ""}
+Terminal actions:
+{"action_type": "classify_flakiness", "argument": "flaky"}
+{"action_type": "classify_flakiness", "argument": "stable"}
+{"action_type": "classify_root_cause", "argument": "OD"}
+{"action_type": "classify_root_cause", "argument": "OD-Brit"}
+{"action_type": "classify_root_cause", "argument": "OD-Vic"}
+{"action_type": "classify_root_cause", "argument": "NIO"}
+{"action_type": "classify_root_cause", "argument": "NOD"}
+{"action_type": "classify_root_cause", "argument": "TD"}
+{"action_type": "classify_root_cause", "argument": "TZD"}
+{"action_type": "classify_root_cause", "argument": "ID"}
+{"action_type": "propose_fix", "argument": "--- a/file.py\\n+++ b/file.py\\n@@ ... @@\\n-old\\n+new"}
+Rules:
+1. Read the test file first.
+2. Search for flaky signals: random, time, sleep, shared state, env vars.
+3. Run the test for non-order-dependent scenarios.
+4. Call one terminal action when confident.
+"""
+def _to_single_line(text: str) -> str:
+    return " ".join(str(text).split())
+def _compliance_log_start(task: str, benchmark: str, model: str) -> None:
+    print(f"[START] task={task} env={benchmark} model={model}", flush=True)
+def _compliance_log_step(
+    step: int,
+    action: str,
+    reward: float,
+    done: bool,
+    error: str | None,
+) -> None:
+    error_value = _to_single_line(error) if error else "null"
+    print(
+        f"[STEP] step={step} action={_to_single_line(action)} "
+        f"reward={reward:.2f} done={str(bool(done)).lower()} error={error_value}",
+        flush=True,
+    )
+def _compliance_log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
+    rewards_value = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(bool(success)).lower()} steps={steps} "
+        f"score={score:.2f} rewards={rewards_value}",
+        flush=True,
+    )
+def obs_to_prompt(obs: FlakySleuthObservation) -> str:
+    tree_preview = "\n".join(obs.file_tree[:40])
+    return f"""TASK: {obs.task_description}
+Repository: {obs.repo_url}
+Test name: {obs.test_name}
+Step: {obs.step_count}/{MAX_STEPS}
+Test source code:
+```python
+{obs.test_code}
+```
+Repository file tree:
+{tree_preview}
+Last tool output:
+{obs.tool_output or "(No action taken yet)"}
+Return only JSON action."""
+def heuristic_action(obs: FlakySleuthObservation) -> FlakySleuthAction:
+    if obs.step_count == 0 and obs.file_tree:
+        return FlakySleuthAction(action_type="read_file", argument=obs.file_tree[0])
+    if obs.step_count < 2:
+        return FlakySleuthAction(action_type="search_code", argument="random")
+    if obs.task_type == "classify":
+        return FlakySleuthAction(action_type="classify_flakiness", argument="flaky")
+    if obs.task_type == "root_cause":
+        return FlakySleuthAction(action_type="classify_root_cause", argument="NOD")
+    return FlakySleuthAction(
+        action_type="propose_fix",
+        argument=(
+            "--- a/src/math_utils.py\n"
+            "+++ b/src/math_utils.py\n"
+            "@@\n"
+            "-def unstable_sum(values):\n"
+            "-    random.shuffle(values)\n"
+            "-    return values[0] + values[1]\n"
+            "+def unstable_sum(values):\n"
+            "+    ordered = sorted(values)\n"
+            "+    return ordered[0] + ordered[1]\n"
+        ),
+    )
+def llm_action(
+    messages: list[dict[str, str]],
+) -> tuple[FlakySleuthAction | None, dict[str, Any]]:
+    meta: dict[str, Any] = {
+        "attempted": False,
+        "raw_output": "",
+        "error": "",
+    }
+    if not API_KEY:
+        return None, meta
+    meta["attempted"] = True
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=messages,
+        max_tokens=400,
+        temperature=0.0,
+    )
+    raw = (response.choices[0].message.content or "").strip()
+    meta["raw_output"] = raw
+    cleaned = raw.replace("```json", "").replace("```", "").strip()
+    payload = json.loads(cleaned)
+    return FlakySleuthAction.model_validate(payload), meta
+def _clip_text(text: str, max_chars: int) -> str:
+    if max_chars <= 0:
+        return text
+    if len(text) <= max_chars:
+        return text
+    remaining = len(text) - max_chars
+    return f"{text[:max_chars]}\n...[truncated {remaining} chars]"
+def _trace_print(
+    enabled: bool,
+    message: str,
+    *,
+    text: str | None = None,
+    max_chars: int = 0,
+) -> None:
+    if not enabled:
+        return
+    print(message)
+    if text is not None:
+        print(_clip_text(text, max_chars))
+def _format_duration(seconds: float) -> str:
+    seconds = max(0.0, float(seconds))
+    mins, secs = divmod(int(round(seconds)), 60)
+    hrs, mins = divmod(mins, 60)
+    if hrs > 0:
+        return f"{hrs:d}h {mins:02d}m {secs:02d}s"
+    return f"{mins:02d}m {secs:02d}s"
+def run_episode(
+    env: FlakySleuthEnv,
+    *,
+    print_terminal: bool = True,
+    trace_agent: bool = False,
+    trace_prompts: bool = False,
+    trace_max_chars: int = 2000,
+    episode_label: str = "",
+    compliance_stdout: bool = False,
+    benchmark_name: str = "flakysleuth",
+    compliance_task_name: str | None = None,
+) -> tuple[float, dict[str, Any]]:
+    rewards: list[float] = []
+    steps_taken = 0
+    success = False
+    episode_task_name = (compliance_task_name or episode_label.split(" ", 1)[0].strip() or "unknown")
+    exploration_reward_total = 0.0
+    final_episode_score = 0.0
+    terminal_meta: dict[str, Any] = {}
+    if compliance_stdout:
+        _compliance_log_start(episode_task_name, benchmark_name, MODEL_NAME)
+    try:
+        obs = env.reset()
+        initial_prompt = obs_to_prompt(obs)
+        messages = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": initial_prompt},
+        ]
+        if not compliance_stdout:
+            _trace_print(
+                trace_agent,
+                (
+                    f"\n[trace] {episode_label} "
+                    f"task={obs.task_type} repo={obs.repo_url} test={obs.test_name}"
+                ).strip(),
+            )
+        if trace_prompts and not compliance_stdout:
+            _trace_print(
+                trace_agent,
+                "[trace] system prompt:",
+                text=SYSTEM_PROMPT,
+                max_chars=trace_max_chars,
+            )
+            _trace_print(
+                trace_agent,
+                "[trace] initial user prompt:",
+                text=initial_prompt,
+                max_chars=trace_max_chars,
+            )
+        for step_idx in range(MAX_STEPS):
+            action: FlakySleuthAction
+            action_source = "heuristic"
+            llm_meta: dict[str, Any] = {"attempted": False, "raw_output": "", "error": ""}
+            try:
+                candidate, llm_meta = llm_action(messages)
+                if candidate is not None:
+                    action = candidate
+                    action_source = "llm"
+                else:
+                    action = heuristic_action(obs)
+                    if llm_meta.get("attempted"):
+                        llm_meta["error"] = (
+                            "Model response unavailable, using heuristic fallback."
+                        )
+            except Exception as exc:
+                llm_meta["error"] = str(exc)
+                action = heuristic_action(obs)
+            if trace_agent and not compliance_stdout:
+                print(f"[trace] step={step_idx + 1} action_source={action_source}")
+                if llm_meta.get("attempted"):
+                    _trace_print(
+                        True,
+                        "[trace] raw model output:",
+                        text=str(llm_meta.get("raw_output", "")),
+                        max_chars=trace_max_chars,
+                    )
+                if llm_meta.get("error"):
+                    print(f"[trace] llm_error={llm_meta['error']}")
+                print(f"[trace] action={action.model_dump_json()}")
+            obs, reward, done, info = env.step(action)
+            rewards.append(reward)
+            steps_taken = step_idx + 1
+            step_error: str | None = None
+            if isinstance(info, dict):
+                raw_err = info.get("last_action_error")
+                if raw_err:
+                    step_error = str(raw_err)
+            if not step_error and obs.tool_output and str(obs.tool_output).startswith("ERROR:"):
+                step_error = str(obs.tool_output)
+            if compliance_stdout:
+                _compliance_log_step(
+                    step=steps_taken,
+                    action=action.model_dump_json(),
+                    reward=reward,
+                    done=done,
+                    error=step_error,
+                )
+            if trace_agent and not compliance_stdout:
+                print(
+                    f"[trace] step_result reward={reward:.3f} done={done} "
+                    f"step_count={obs.step_count}"
+                )
+                if obs.tool_output:
+                    _trace_print(
+                        True,
+                        "[trace] tool_output:",
+                        text=obs.tool_output,
+                        max_chars=trace_max_chars,
+                    )
+            if done:
+                # Terminal reward already includes cumulative progress + terminal score.
+                final_episode_score = reward
+                terminal_meta = {
+                    "action_type": action.action_type,
+                    "terminal_score": float(info.get("terminal_score", 0) or 0),
+                    "progress_score": float(info.get("progress_score", 0) or 0),
+                    "explore_sum": exploration_reward_total,
+                    "episode_score": final_episode_score,
+                }
+                success = final_episode_score > 0.0
+                if print_terminal:
+                    print(
+                        f"  Terminal: {action.action_type}({action.argument[:40]}) "
+                        f"-> terminal={info.get('terminal_score', 0):.2f} "
+                        f"progress={info.get('progress_score', 0):.2f} "
+                        f"explore_sum={exploration_reward_total:.3f} "
+                        f"episode_score={final_episode_score:.3f}"
+                    )
+                break
+            exploration_reward_total += reward
+            messages.append({"role": "assistant", "content": action.model_dump_json()})
+            next_prompt = obs_to_prompt(obs)
+            messages.append({"role": "user", "content": next_prompt})
+            if trace_agent and trace_prompts and not compliance_stdout:
+                _trace_print(
+                    True,
+                    f"[trace] next user prompt (step={step_idx + 1}):",
+                    text=next_prompt,
+                    max_chars=trace_max_chars,
+                )
+    except Exception as exc:
+        terminal_meta["error"] = str(exc)
+        success = False
+        if not compliance_stdout:
+            raise
+    finally:
+        if compliance_stdout:
+            try:
+                env.close()
+            except Exception:
+                pass
+            _compliance_log_end(
+                success=success,
+                steps=steps_taken,
+                score=min(max(final_episode_score, 0.0), 1.0),
+                rewards=rewards,
+            )
+    return final_episode_score, terminal_meta
+def _looks_like_placeholder_dataset(dataset_path: str) -> bool:
+    path = Path(dataset_path)
+    if not path.exists():
+        return False
+    try:
+        text = path.read_text(encoding="utf-8", errors="replace")
+    except Exception:
+        return False
+    return "fixture://" in text
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run FlakySleuth baseline inference.")
+    parser.add_argument(
+        "--dataset-path",
+        default="dataset/py_tasks.csv",
+        help="Processed task CSV used by the environment.",
+    )
+    parser.add_argument(
+        "--episodes-per-task",
+        type=int,
+        default=EPISODES_PER_TASK,
+        help="Episodes per task type.",
+    )
+    parser.add_argument(
+        "--task-types",
+        default="classify,root_cause,fix_proposal",
+        help="Comma-separated task types to run (classify,root_cause,fix_proposal).",
+    )
+    parser.add_argument(
+        "--no-progress",
+        action="store_true",
+        help="Disable progress bars and print classic per-episode logs.",
+    )
+    parser.add_argument(
+        "--trace-agent",
+        action="store_true",
+        help=(
+            "Print detailed agent trace: model output, chosen action/tool call, and "
+            "step results for every episode."
+        ),
+    )
+    parser.add_argument(
+        "--trace-prompts",
+        action="store_true",
+        help="When tracing, also print full prompts sent to the model.",
+    )
+    parser.add_argument(
+        "--trace-max-chars",
+        type=int,
+        default=2500,
+        help="Max chars per traced text block (prompt/model output/tool output).",
+    )
+    parser.add_argument(
+        "--compliance-stdout",
+        action="store_true",
+        help=(
+            "Emit strict compliance logs to stdout using only [START]/[STEP]/[END] lines "
+            "for each episode."
+        ),
+    )
+    parser.add_argument(
+        "--benchmark-name",
+        default="flakysleuth",
+        help="Benchmark name used in [START] lines when --compliance-stdout is enabled.",
+    )
+    return parser.parse_args()
+def main() -> None:
+    run_start = time.perf_counter()
+    args = _parse_args()
+    env = FlakySleuthEnv(dataset_path=args.dataset_path)
+    allowed_task_types = {"classify", "root_cause", "fix_proposal"}
+    task_types = [t.strip() for t in args.task_types.split(",") if t.strip()]
+    invalid = [t for t in task_types if t not in allowed_task_types]
+    if invalid:
+        raise ValueError(
+            f"Invalid task type(s): {invalid}. "
+            "Valid values: classify,root_cause,fix_proposal."
+        )
+    if not task_types:
+        raise ValueError(
+            "No task types selected. Pass --task-types with at least one value."
+        )
+    results: dict[str, list[float]] = defaultdict(list)
+    if _looks_like_placeholder_dataset(args.dataset_path) and not args.compliance_stdout:
+        print(
+            "[warning] dataset appears to contain fixture rows (fixture://...). "
+            "Build real dataset from py-data.csv for real evaluation."
+        )
+    use_progress = (tqdm is not None) and (not args.no_progress) and (not args.compliance_stdout)
+    if args.trace_agent and use_progress and not args.compliance_stdout:
+        print(
+            "[info] --trace-agent enabled, disabling progress bars for readable trace logs."
+        )
+        use_progress = False
+    overall_bar = None
+    if use_progress:
+        overall_bar = tqdm(
+            total=len(task_types) * args.episodes_per_task,
+            desc="All tasks",
+            unit="ep",
+            dynamic_ncols=True,
+        )
+    for task_type in task_types:
+        task_start = time.perf_counter()
+        if not args.compliance_stdout:
+            print(f"\n-- Task type: {task_type} --")
+        env.loader.force_task_type(task_type)
+        task_bar = None
+        if use_progress:
+            task_bar = tqdm(
+                total=args.episodes_per_task,
+                desc=f"{task_type}",
+                unit="ep",
+                leave=False,
+                dynamic_ncols=True,
+            )
+        for episode in range(args.episodes_per_task):
+            score, meta = run_episode(
+                env,
+                print_terminal=(not use_progress) and (not args.compliance_stdout),
+                trace_agent=args.trace_agent,
+                trace_prompts=args.trace_prompts,
+                trace_max_chars=args.trace_max_chars,
+                episode_label=f"{task_type} ep={episode + 1}/{args.episodes_per_task}",
+                compliance_stdout=args.compliance_stdout,
+                benchmark_name=args.benchmark_name,
+                compliance_task_name=task_type,
+            )
+            results[task_type].append(score)
+            if use_progress and task_bar is not None:
+                task_bar.update(1)
+                task_avg = sum(results[task_type]) / len(results[task_type])
+                task_bar.set_postfix(
+                    score=f"{score:.3f}",
+                    avg=f"{task_avg:.3f}",
+                    term=f"{meta.get('terminal_score', 0):.2f}",
+                )
+                if overall_bar is not None:
+                    overall_bar.update(1)
+                    all_scores = [s for values in results.values() for s in values]
+                    overall_avg = sum(all_scores) / len(all_scores)
+                    overall_bar.set_postfix(task=task_type, avg=f"{overall_avg:.3f}")
+            elif not args.compliance_stdout:
+                print(f"  Episode {episode + 1}: {score:.3f}")
+        if task_bar is not None:
+            task_bar.close()
+        task_elapsed = time.perf_counter() - task_start
+        if not args.compliance_stdout:
+            avg_task = sum(results[task_type]) / max(1, len(results[task_type]))
+            print(
+                f"  [time] task={task_type} elapsed={_format_duration(task_elapsed)} "
+                f"avg_ep={task_elapsed / max(1, args.episodes_per_task):.2f}s "
+                f"avg_score={avg_task:.3f}"
+            )
+    if overall_bar is not None:
+        overall_bar.close()
+    if args.compliance_stdout:
+        return
+    total_elapsed = time.perf_counter() - run_start
+    print("\n== BASELINE RESULTS ==")
+    all_scores: list[float] = []
+    for task_type in task_types:
+        scores = results[task_type]
+        avg = sum(scores) / len(scores)
+        all_scores.extend(scores)
+        print(f"  {task_type:12s} avg={avg:.3f} scores={[round(s, 3) for s in scores]}")
+    overall = sum(all_scores) / len(all_scores)
+    print(f"  {'OVERALL':12s} avg={overall:.3f}")
+    print(
+        f"  {'RUNTIME':12s} total={_format_duration(total_elapsed)} "
+        f"episodes={len(all_scores)} "
+        f"avg_ep={(total_elapsed / max(1, len(all_scores))):.2f}s"
+    )
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from env.models import FlakySleuthAction, FlakySleuthObservation, FlakySleuthReward
2	+
3	+ __all__ = ["FlakySleuthAction", "FlakySleuthObservation", "FlakySleuthReward"]

openenv.yaml ADDED Viewed

	@@ -0,0 +1,37 @@

+spec_version: 1
+name: flaky_sleuth
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+version: 0.1.0
+description: >
+  An RL environment where an LLM agent investigates flaky tests in Python repositories.
+  The agent uses tool-like actions to read files, search code, and run tests, then submits
+  a terminal verdict for classification, root-cause detection, or fix proposal.
+action_type: FlakySleuthAction
+observation_type: FlakySleuthObservation
+reward_range: [0.0, 1.0]
+episode_max_steps: 20
+baseline_script: inference.py
+tasks:
+  - id: task1_classify
+    name: Flaky vs Stable Classification
+    difficulty: easy
+    description: Classify the target test as flaky or stable.
+  - id: task2_root_cause
+    name: Root Cause Category Identification
+    difficulty: medium
+    description: Predict flaky-test root-cause category (OD, NOD, TD, TZD, NIO, ID, etc.).
+  - id: task3_fix_proposal
+    name: Fix Proposal
+    difficulty: hard
+    description: Propose a concrete fix as unified diff for a known flaky test.
+infra:
+  vcpu: 2
+  memory_gb: 8
+  max_inference_minutes: 20

pyproject.toml ADDED Viewed

	@@ -0,0 +1,34 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-flaky-sleuth"
+version = "0.1.0"
+description = "FlakySleuth OpenEnv environment for flaky test investigation"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.3",
+    "fastapi>=0.110.0",
+    "uvicorn>=0.27.0",
+    "pydantic>=2.0.0",
+    "openai>=1.0.0",
+    "pandas>=2.0.0",
+    "pytest>=7.0.0",
+    "pytest-timeout>=2.0.0",
+    "requests>=2.31.0",
+    "tqdm>=4.66.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["env", "graders", "server"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+fastapi>=0.110.0
+uvicorn>=0.27.0
+pydantic>=2.0.0
+openai>=1.0.0
+pandas>=2.0.0
+pytest>=7.0.0
+pytest-timeout>=2.0.0
+requests>=2.31.0
+tqdm>=4.66.0
+openenv-core[core]>=0.2.3

server.py ADDED Viewed

	@@ -0,0 +1,8 @@

+"""Compatibility entrypoint for running the API as `python server.py`."""
+from server.app import app, main
+__all__ = ["app", "main"]
+if __name__ == "__main__":
+    main()

server/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from server.app import app
2	+
3	+ __all__ = ["app"]

server/app.py ADDED Viewed

	@@ -0,0 +1,102 @@

+from __future__ import annotations
+from typing import Any
+from fastapi import Body, FastAPI, HTTPException
+from pydantic import BaseModel, ValidationError
+from env.environment import FlakySleuthEnv
+from env.models import FlakySleuthAction, FlakySleuthObservation
+app = FastAPI(title="FlakySleuth Environment")
+env = FlakySleuthEnv()
+class FlakySleuthState(BaseModel):
+    repo_url: str | None = None
+    test_name: str | None = None
+    task_type: str | None = None
+    step_count: int
+    files_read: list[str]
+    cumulative_progress: float
+@app.post("/reset")
+def reset() -> dict[str, Any]:
+    observation = env.reset()
+    return {
+        "observation": observation.model_dump(),
+        "reward": None,
+        "done": False,
+    }
+@app.post("/step")
+def step(payload: dict[str, Any] = Body(...)) -> dict[str, Any]:
+    """Accept either {'action': {...}} or direct action payload."""
+    try:
+        action_payload = payload.get("action", payload)
+        action = FlakySleuthAction.model_validate(action_payload)
+    except ValidationError as exc:
+        raise HTTPException(status_code=422, detail=exc.errors()) from exc
+    try:
+        observation, reward, done, info = env.step(action)
+    except RuntimeError as exc:
+        raise HTTPException(status_code=400, detail=str(exc)) from exc
+    return {
+        "observation": observation.model_dump(),
+        "reward": reward,
+        "done": done,
+        "info": info,
+    }
+@app.get("/state")
+def state() -> dict[str, Any]:
+    return env.state()
+@app.get("/schema")
+def schema() -> dict[str, Any]:
+    return {
+        "action": FlakySleuthAction.model_json_schema(),
+        "observation": FlakySleuthObservation.model_json_schema(),
+        "state": FlakySleuthState.model_json_schema(),
+    }
+@app.get("/health")
+def health() -> dict[str, str]:
+    return {"status": "healthy"}
+@app.get("/metadata")
+def metadata() -> dict[str, str]:
+    return {
+        "name": "FlakySleuth Environment",
+        "description": (
+            "RL environment for flaky-test investigation in Python repositories."
+        ),
+    }
+@app.post("/mcp")
+def mcp(payload: dict[str, Any] = Body(default_factory=dict)) -> dict[str, Any]:
+    request_id = payload.get("id")
+    return {
+        "jsonrpc": "2.0",
+        "id": request_id,
+        "result": {"status": "ok"},
+    }
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    main()

tests/test_compliance.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from env.environment import FlakySleuthEnv
+from env.models import FlakySleuthAction
+def test_reset_and_step_smoke():
+    env = FlakySleuthEnv(dataset_path="dataset/py_tasks.csv")
+    obs = env.reset()
+    assert obs.test_name
+    assert obs.task_type in {"classify", "root_cause", "fix_proposal"}
+    action = FlakySleuthAction(action_type="search_code", argument="random")
+    next_obs, reward, done, info = env.step(action)
+    assert isinstance(next_obs.file_tree, list)
+    assert isinstance(reward, float)
+    assert isinstance(done, bool)
+    assert isinstance(info, dict)

uv.lock ADDED Viewed

File without changes