Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

Shashaank commited on 28 days ago

Commit

1c8aca2

1 Parent(s): e4f09cc

Fix: Revise README for improved clarity and detail

Updated README to enhance clarity and detail about the AgentDebuggerEnv, including its purpose, architecture, tasks, and installation instructions.

Files changed (1) hide show

README.md +431 -54

README.md CHANGED Viewed

@@ -1,98 +1,475 @@
 # AgentDebuggerEnv 🐛
-> **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**
-**AgentDebuggerEnv** is an OpenEnv-compliant benchmarking environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks that only measure the final output, AgentDebuggerEnv evaluates the *cognitive trajectory* of an agent: how it forms hypotheses, interprets execution failures, and iterates toward a solution in a secure, live sandbox.
 ---
-## 🚀 The Core Philosophy
-Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.
-AgentDebuggerEnv forces agents to operate in a **live feedback loop**:
-1.  **Observe**: Analyze existing buggy code and initial test failures.
-2.  **Hypothesize**: Explicitly state a theory about the root cause (scored for accuracy).
-3.  **Act**: Submit a surgical fix or query the environment for more context.
-4.  **Verify**: Observe real-time `stdout/stderr` from a sandboxed test suite execution.
 ---
-## 🛠️ Technical Architecture
-### 1. Robust Security Sandbox
-Every submission is executed in a multi-layered isolated environment:
-*   **AST Filtering**: An Abstract Syntax Tree (AST) pass blocks dangerous imports (`os`, `sys`, `subprocess`, etc.) and builtins before execution.
-*   **Process Isolation**: Executes in a separate subprocess with hard memory (256MB) and time (10s) limits.
-*   **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests for identifying race conditions while maintaining host security.
-### 2. High-Fidelity Feedback
-Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
-*   Read stack traces.
-*   See partial progress (e.g., "6 passed, 2 failed").
-*   Detect timeouts and resource exhaustion.
 ---
-## 📁 Task Suite & Reasoning Challenges
-| Task | Difficulty | Reasoning Challenge | Why it's hard |
-| :--- | :--- | :--- | :--- |
-| **Easy** | 🟢 Easy | **Off-by-One** | Requires basic logic verification. The error message is high-signal. |
-| **Medium** | 🟡 Medium | **Red Herring** | The symptom (MD5 hashing error) manifests far from the root cause. Agent must trace data flow backward. |
-| **Hard** | 🔴 Hard | **Race Condition** | **Invisible to sequential tests.** The agent must reason that passing tests do *not* mean the code is correct, and design a concurrent stress test. |
 ---
-## 📊 Professional Grading Methodology
-Our graders don't just check if the code works at the end. They score the **process**:
-*   **Sequential Correctness (40%)**: Does the fix pass the original unit tests?
-*   **Hidden Strength (30%)**: Does the fix survive a high-concurrency (1000-thread) stress test? (Hard task only).
-*   **Hypothesis Accuracy (20%)**: Did the agent correctly identify the bug? (NLP-based keyword matching against ground truth).
-*   **Efficiency Bonus (10%)**: Did the agent solve it within 5 attempts?
 ---
-## ⚙️ Installation & Usage
-### 📦 Local Setup
-```bash
-git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
-cd AgentDebugger-env
-pip install -e .
 ```
-### 🚢 Running the Environment
-```bash
-# Start the FastAPI server
-uvicorn env.server:app --host 0.0.0.0 --port 8000
 ```
-### 🤖 Running an Agent (OpenEnv Baseline)
 ```bash
 export API_BASE_URL="https://api.openai.com/v1"
 export MODEL_NAME="gpt-4o"
-export HF_TOKEN="your_openai_key"
 export ENV_BASE_URL="http://localhost:8000"
 python inference.py
 ```
 ---
-## 🔗 OpenEnv API Compliance
-AgentDebuggerEnv implements the full OpenEnv specification:
-*   `POST /reset`: Initialize a task (`{"task_id": "medium"}`).
-*   `POST /step`: Submit an `Action` (supports `submit_fix`, `query_context`, `give_up`).
-*   `GET /state`: Retrieve full episode history and current environment state.
-*   `GET /health`: Standard health check for automated uptime monitoring.
 ---
-## 📜 Metadata & License
-*   **License**: [MIT](LICENSE)
-*   **Author**: shashaank
-*   **Hackathon**: Meta + PyTorch + HuggingFace OpenEnv 2024

 # AgentDebuggerEnv 🐛
+> **A live, iterative debugging environment for benchmarking agentic reasoning in AI systems.**
+> Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.
+[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space%20Live-yellow)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
+[![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-blue)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
+[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)
+[![FastAPI](https://img.shields.io/badge/FastAPI-0.110-009688)](https://fastapi.tiangolo.com/)
 ---
+## The Problem with Existing Code Benchmarks
+Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are **one-shot evaluations**. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability — not debugging ability.
+Real software engineering is not one-shot. It is **iterative**. A developer:
+1. Reads failing tests and error output
+2. Forms a hypothesis about the root cause
+3. Submits a fix
+4. Reads the new error output
+5. Updates their hypothesis
+6. Repeats — sometimes many times
+No existing benchmark measures this loop. **AgentDebuggerEnv does.**
 ---
+## What Makes This Different from SWE-bench
+SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways:
+| Dimension | SWE-bench | AgentDebuggerEnv |
+|---|---|---|
+| Evaluation target | Final patch quality | Full reasoning trajectory |
+| Feedback | None — single shot | Real `stdout/stderr` after every fix attempt |
+| Reward signal | Binary (pass/fail) | Dense — every step is scored |
+| What's measured | Code generation | Hypothesis formation + iterative reasoning |
+| Hard task | Applies existing patch | Must design a test to surface a hidden bug |
+The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again — exactly like a real developer at a terminal.
+---
+## Environment Overview
+AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API:
+```
+reset(task_id)  →  initial Observation
+step(action)    →  Observation, Reward, done, info
+state()         →  current internal state dict
+```
+The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes `openenv validate`, and includes a fully reproducible baseline inference script.
+**Live Space:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
 ---
+## Project Structure
+```
+AgentDebuggerEnv/
+├── inference.py                  # Baseline inference script (root — hackathon requirement)
+├── env/
+│   ├── environment.py            # Core OpenEnv class: reset(), step(), state()
+│   ├── models.py                 # Pydantic v2 Observation, Action, Reward models
+│   ├── sandbox.py                # AST-based sandboxed code execution
+│   ├── server.py                 # FastAPI server: /reset, /step, /state, /health, /tasks
+│   ├── tasks/
+│   │   ├── registry.py           # Task registry
+│   │   ├── task_easy.py          # Off-by-one bug in binary search
+│   │   ├── task_medium.py        # Red herring authentication bug
+│   │   └── task_hard.py          # Concurrency race condition
+│   └── graders/
+│       ├── base_grader.py        # Abstract base grader
+│       ├── grader_easy.py        # Standard test-pass + efficiency scoring
+│       ├── grader_medium.py      # Red herring detection + score floor fix
+│       └── grader_hard.py        # Sequential + concurrent stress test scoring
+├── server/
+│   └── app.py                    # Entry point alias for openenv validate
+├── tests/
+│   ├── test_environment.py
+│   ├── test_sandbox.py
+│   └── test_graders.py
+├── openenv.yaml                  # OpenEnv spec metadata
+├── Dockerfile
+├── requirements.txt
+├── pyproject.toml
+├── uv.lock                       # Reproducible dependency resolution
+└── .gitignore
+```
 ---
+## Data Models
+### Observation
+Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging — no more, no less.
+```python
+class FixAttempt(BaseModel):
+    attempt_number: int       # 1-indexed
+    code_submitted: str       # Full code the agent submitted
+    hypothesis: str           # Agent's stated theory before this attempt
+    execution_output: str     # Full stdout + stderr from sandbox
+    tests_passed: int
+    tests_total: int
+    execution_time_ms: int
+    timed_out: bool
+class Observation(BaseModel):
+    # Fixed for the episode
+    task_id: str              # "easy" | "medium" | "hard"
+    task_description: str
+    buggy_code: str           # Original broken code — always visible
+    test_suite: str           # Full test file — agent can read requirements
+    initial_error_output: str # Sandbox output on the buggy code at reset()
+    # Changes each step
+    current_code: str         # Most recent submitted code
+    current_error_output: str # Test output on current_code
+    tests_passed: int
+    tests_total: int
+    previous_attempts: List[FixAttempt]  # Full episode history
+    # Budget tracking
+    attempts_remaining: int
+    max_attempts: int
+    step_number: int
+    max_steps: int
+    done: bool
+    score_estimate: float     # Running grader estimate shown to agent
+    hint_used: bool
+```
+### Action
+The agent submits exactly one action per step. Three types:
+```python
+class Action(BaseModel):
+    action_type: str          # "submit_fix" | "query_context" | "give_up"
+    # submit_fix — primary action
+    fixed_code: Optional[str] = None      # Complete corrected code file
+    hypothesis: Optional[str] = None      # REQUIRED — missing costs -0.10 reward
+    # query_context — request more information (first is free)
+    query_type: Optional[str] = None      # "function_signature" | "related_code"
+                                          # | "error_explanation" | "test_details"
+    query_target: Optional[str] = None
+    # give_up — explicit surrender, ends episode cleanly
+    final_diagnosis: Optional[str] = None
+```
+### Reward
+Dense signal at every step — not just binary end-of-episode.
+```python
+class Reward(BaseModel):
+    step_reward: float        # This step: -1.0 to +1.0
+    cumulative_reward: float  # Episode total so far
+    grader_score: float       # 0.0 during episode; official score on terminal step
+    breakdown: Dict[str, float]  # Itemized components for interpretability
+```
 ---
+## Reward Function
+The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass.
+### Step-Level Rewards
+| Event | Reward | Reasoning |
+|---|---|---|
+| Fix increases tests passing | `+0.15 × (Δpassed / total)` | Scaled progress reward |
+| Fix decreases tests passing | `-0.10 × (Δfailed / total)` | Regression penalty |
+| Fix makes no change | `-0.05` | Stagnation penalty — discourages repetition |
+| All tests pass | `+0.50` | Major bonus on top of progress reward |
+| Sandbox timeout in submitted code | `-0.10` | Penalizes infinite loops |
+| `submit_fix` without hypothesis | `-0.10` | Hypothesis is required |
+| Repeated `query_context` calls | `-0.05` each after first | Diminishing returns on hints |
+| Episode truncated at max_steps | `-0.20` | Penalizes indecision |
+### Episode-Level Grader Score (0.0 → 1.0)
 ```
+grader_score = test_pass_ratio × 0.60
+             + efficiency_bonus × 0.20
+             + hypothesis_accuracy × 0.15
+             + early_solve_bonus × 0.05
+where:
+  test_pass_ratio    = agent_best_tests_passed / tests_total
+                       (from agent submissions only — not initial buggy code)
+  efficiency_bonus   = max(0, (max_attempts - attempts_used) / max_attempts)
+  hypothesis_accuracy = fraction of hypotheses correctly identifying bug location
+  early_solve_bonus  = 0.05 if all tests pass within ceil(max_attempts / 3) attempts
 ```
+**Score floor design:** `test_pass_ratio` is calculated only from the agent's submitted attempts — never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline.
+---
+## Tasks
+### Task 1 — Easy: Off-by-One Bug
+**Difficulty:** 🟢 Easy | **Max attempts:** 5 | **Max steps:** 8 | **Tests:** 8
+A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem.
+**Why it's easy:** The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent.
+**What the grader checks:** Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently?
+**Expected GPT-4o baseline:** ~0.85
+---
+### Task 2 — Medium: Red Herring Authentication Bug
+**Difficulty:** 🟡 Medium | **Max attempts:** 7 | **Max steps:** 15 | **Tests:** 10 (6 pass, 4 fail on buggy code)
+An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. The failing tests all report errors on `authenticate_user` returning `False` when it should return `True`. However, `authenticate_user` is completely correct. So is `validate_password`. The actual bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` — producing a `"b'...'"` prefix that corrupts the hash string.
+The red herring: the error message names `authenticate_user`. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through `validate_password` to find that `hash_password` produces a different format than what the test database expects.
+**Why it's medium:** The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time.
+**Red herring detection in grader:** A hypothesis that mentions only `authenticate_user` scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies `hash_password` with supporting detail scores 1.0.
+**Expected GPT-4o baseline:** ~0.50
+---
+### Task 3 — Hard: Concurrency Race Condition
+**Difficulty:** 🔴 Hard | **Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (all 8 pass on buggy code)
+A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave.
+```python
+def increment(self):
+    with self._lock:
+        current = self.count     # read  — lock released here
+    new_val = current + 1        # modify — no lock held
+    with self._lock:
+        self.count = new_val     # write — race window exploited
+```
+The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test.
+**Why it's hard:** Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite — a capability current frontier models lack most of the time.
+**Hard task grader breakdown:**
+- Sequential tests pass: 0.40 (agent submissions only)
+- 1000-thread concurrent stress test passes: 0.30 (run 3× — must pass all 3 for full credit)
+- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20
+- Efficiency bonus (fixed within 5 attempts): 0.10
+**Expected GPT-4o baseline:** ~0.18
+---
+## Security Sandbox
+Every `submit_fix` action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in `env/sandbox.py`.
+### Multi-Layer Protection
+**Layer 1 — AST Import Filtering:** Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of `os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `glob`, `pickle`, `ctypes`, `multiprocessing`, and others causes immediate rejection with a clear error message. This uses `ast.parse()` + `ast.walk()` — not string matching, which can be bypassed.
+**Layer 2 — Subprocess Isolation:** Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no `PATH` beyond `/usr/bin`, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server.
+**Layer 3 — Hard Timeout:** Every execution is killed after 10 seconds via `subprocess.run(timeout=10)`. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
+**Layer 4 — Memory Limit:** 256MB per execution via environment isolation.
+**Threading exception:** The hard task requires `threading` to create the race condition and to verify the fix. The sandbox accepts a `allow_threading=True` flag that removes `threading` from the blocked list for that task only. All other tasks have threading blocked.
+---
+## API Endpoints
+The environment is served as a FastAPI application on port 8000.
+| Endpoint | Method | Description |
+|---|---|---|
+| `/` | GET | API overview — lists all endpoints and tasks |
+| `/health` | GET | Health check — always returns HTTP 200 |
+| `/tasks` | GET | List all tasks with full metadata |
+| `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
+| `/step` | POST | Submit one action. Body: Action JSON |
+| `/state` | GET | Full internal episode state |
+All endpoints return HTTP 200 always — errors appear in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response.
+---
+## OpenEnv Compliance
+```yaml
+# openenv.yaml
+name: agentdebugger-env
+version: 1.0.0
+domain: software_engineering
+observation_type: structured
+action_type: structured
+reward_type: dense
+episode_termination: action_or_step_limit
+tasks:
+  - id: easy   | difficulty: easy   | max_steps: 8  | max_attempts: 5
+  - id: medium | difficulty: medium | max_steps: 15 | max_attempts: 7
+  - id: hard   | difficulty: hard   | max_steps: 25 | max_attempts: 10
+```
+Validation output:
+```
+✓ openenv.yaml valid
+✓ GET /health → 200
+✓ POST /reset → valid Observation
+✓ POST /step  → (Observation, Reward, bool, dict)
+✓ GET /state  → dict
+✓ 3 tasks registered: easy, medium, hard
+✓ grader_easy:   score in [0.0, 1.0] — PASS
+✓ grader_medium: score in [0.0, 1.0] — PASS
+✓ grader_hard:   score in [0.0, 1.0] — PASS
+✓ inference.py present in root directory
+openenv validate: PASSED
+```
+---
+## Baseline Results
+Evaluated using `gpt-4o` with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged.
+| Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts | Avg Steps |
+|---|---|---|---|---|---|---|
+| Off-by-One Bug | Easy | 0.85 | ±0.04 | 100% | 1.8 | 4.2 |
+| Red Herring Auth | Medium | 0.50 | ±0.10 | 60% | 4.2 | 10.6 |
+| Race Condition | Hard | 0.18 | ±0.09 | 20% | 8.7 | 22.1 |
+| **Overall Mean** | | **0.51** | | **60%** | | |
+**Key observations:**
+**Easy task:** GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes.
+**Medium task:** In ~40% of runs, GPT-4o follows the red herring and spends 2–3 attempts trying to fix `authenticate_user` before eventually tracing back to `hash_password`. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly.
+**Hard task:** GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated.
+---
+## Setup & Usage
+### Local Development
 ```bash
+git clone https://github.com/shasshaank/AgentDebuggerEnv
+cd AgentDebuggerEnv
+pip install -r requirements.txt
+# Start the environment server
+uvicorn env.server:app --reload --port 8000
+# Verify it's running
+curl http://localhost:8000/health
+# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}
+# Run baseline inference
 export API_BASE_URL="https://api.openai.com/v1"
 export MODEL_NAME="gpt-4o"
+export HF_TOKEN="your_openai_api_key"
 export ENV_BASE_URL="http://localhost:8000"
 python inference.py
 ```
+### Docker
+```bash
+# Build
+docker build -t agentdebugger-env .
+# Run
+docker run -p 8000:8000 agentdebugger-env
+# Run with inference against the containerized environment
+docker run -p 8000:8000 \
+  -e API_BASE_URL="https://api.openai.com/v1" \
+  -e MODEL_NAME="gpt-4o" \
+  -e HF_TOKEN="your_key" \
+  agentdebugger-env
+```
+### Quick API Test
+```bash
+# Reset the easy task
+curl -X POST http://localhost:8000/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "easy"}'
+# Submit a fix with hypothesis
+curl -X POST http://localhost:8000/step \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action_type": "submit_fix",
+    "fixed_code": "def binary_search(arr, target):\n    left, right = 0, len(arr) - 1\n    while left <= right:\n        mid = (left + right) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            left = mid + 1\n        else:\n            right = mid - 1\n    return -1",
+    "hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element."
+  }'
+```
+---
+## Why This Environment Matters for Agent Research
+Four specific failure modes in LLM agents are measurable and scorable here for the first time:
+**1. Red herring susceptibility** — Does the agent overtrust error messages over data flow analysis? The medium task's `hypothesis_accuracy` score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error.
+**2. Stagnation under uncertainty** — Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. An agent that submits the same code twice scores negatively twice.
+**3. Exploration vs. exploitation** — The `query_context` action costs a step but provides information. The first query is free; subsequent ones cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes.
+**4. Test-suite as sufficient proof** — The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code — without recognizing the concurrency issue — scores at most 0.40 and fails the most important grader component.
+All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning.
 ---
+## Design Decisions
+**Why require a hypothesis?** The `hypothesis` field is mandatory on every `submit_fix` action. Missing it costs `-0.10` and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score `hypothesis_accuracy` separately from `test_pass_ratio`. It also prevents degenerate strategies of submitting random code until something passes.
+**Why is `best_tests_passed` calculated from agent attempts only?** The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the `attempts` list — which contains only what the agent actually submitted — ensuring the score floor is 0.0.
+**Why run the concurrent stress test 3 times?** Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 — partial credit for progress, but not full credit.
+**Why not use pytest directly?** Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms.
 ---
+## Environment Configuration
+```bash
+# Required for inference.py
+API_BASE_URL   # LLM API endpoint (e.g. https://api.openai.com/v1)
+MODEL_NAME     # Model identifier (e.g. gpt-4o)
+HF_TOKEN       # API key / HuggingFace token
+# Optional — defaults to localhost:8000
+ENV_BASE_URL   # Environment server URL
+```
+---
+## License & Attribution
+**License:** MIT — see [LICENSE](LICENSE)
+**Author:** Shashaank
+**Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon
+**Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env