Shashaank commited on
Commit Β·
1c8aca2
1
Parent(s): e4f09cc
Fix: Revise README for improved clarity and detail
Browse filesUpdated README to enhance clarity and detail about the AgentDebuggerEnv, including its purpose, architecture, tasks, and installation instructions.
README.md
CHANGED
|
@@ -1,98 +1,475 @@
|
|
| 1 |
# AgentDebuggerEnv π
|
| 2 |
|
| 3 |
-
> **
|
|
|
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
---
|
| 8 |
|
| 9 |
-
##
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
|
| 31 |
-
* Read stack traces.
|
| 32 |
-
* See partial progress (e.g., "6 passed, 2 failed").
|
| 33 |
-
* Detect timeouts and resource exhaustion.
|
| 34 |
|
| 35 |
---
|
| 36 |
|
| 37 |
-
##
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
-
##
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
---
|
| 57 |
|
| 58 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
### π¦ Local Setup
|
| 61 |
-
```bash
|
| 62 |
-
git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
|
| 63 |
-
cd AgentDebugger-env
|
| 64 |
-
pip install -e .
|
| 65 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
| 71 |
```
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
export API_BASE_URL="https://api.openai.com/v1"
|
| 76 |
export MODEL_NAME="gpt-4o"
|
| 77 |
-
export HF_TOKEN="
|
| 78 |
export ENV_BASE_URL="http://localhost:8000"
|
| 79 |
python inference.py
|
| 80 |
```
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
---
|
| 83 |
|
| 84 |
-
##
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
*
|
| 89 |
-
|
| 90 |
-
*
|
| 91 |
-
* `GET /health`: Standard health check for automated uptime monitoring.
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
-
##
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# AgentDebuggerEnv π
|
| 2 |
|
| 3 |
+
> **A live, iterative debugging environment for benchmarking agentic reasoning in AI systems.**
|
| 4 |
+
> Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.
|
| 5 |
|
| 6 |
+
[](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
|
| 7 |
+
[](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
|
| 8 |
+
[](LICENSE)
|
| 9 |
+
[](https://www.python.org/)
|
| 10 |
+
[](https://fastapi.tiangolo.com/)
|
| 11 |
|
| 12 |
---
|
| 13 |
|
| 14 |
+
## The Problem with Existing Code Benchmarks
|
| 15 |
|
| 16 |
+
Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are **one-shot evaluations**. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability β not debugging ability.
|
| 17 |
|
| 18 |
+
Real software engineering is not one-shot. It is **iterative**. A developer:
|
| 19 |
+
|
| 20 |
+
1. Reads failing tests and error output
|
| 21 |
+
2. Forms a hypothesis about the root cause
|
| 22 |
+
3. Submits a fix
|
| 23 |
+
4. Reads the new error output
|
| 24 |
+
5. Updates their hypothesis
|
| 25 |
+
6. Repeats β sometimes many times
|
| 26 |
+
|
| 27 |
+
No existing benchmark measures this loop. **AgentDebuggerEnv does.**
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
+
## What Makes This Different from SWE-bench
|
| 32 |
+
|
| 33 |
+
SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways:
|
| 34 |
+
|
| 35 |
+
| Dimension | SWE-bench | AgentDebuggerEnv |
|
| 36 |
+
|---|---|---|
|
| 37 |
+
| Evaluation target | Final patch quality | Full reasoning trajectory |
|
| 38 |
+
| Feedback | None β single shot | Real `stdout/stderr` after every fix attempt |
|
| 39 |
+
| Reward signal | Binary (pass/fail) | Dense β every step is scored |
|
| 40 |
+
| What's measured | Code generation | Hypothesis formation + iterative reasoning |
|
| 41 |
+
| Hard task | Applies existing patch | Must design a test to surface a hidden bug |
|
| 42 |
+
|
| 43 |
+
The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again β exactly like a real developer at a terminal.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Environment Overview
|
| 48 |
+
|
| 49 |
+
AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API:
|
| 50 |
|
| 51 |
+
```
|
| 52 |
+
reset(task_id) β initial Observation
|
| 53 |
+
step(action) β Observation, Reward, done, info
|
| 54 |
+
state() β current internal state dict
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes `openenv validate`, and includes a fully reproducible baseline inference script.
|
| 58 |
|
| 59 |
+
**Live Space:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
---
|
| 62 |
|
| 63 |
+
## Project Structure
|
| 64 |
|
| 65 |
+
```
|
| 66 |
+
AgentDebuggerEnv/
|
| 67 |
+
βββ inference.py # Baseline inference script (root β hackathon requirement)
|
| 68 |
+
βββ env/
|
| 69 |
+
β βββ environment.py # Core OpenEnv class: reset(), step(), state()
|
| 70 |
+
β βββ models.py # Pydantic v2 Observation, Action, Reward models
|
| 71 |
+
β βββ sandbox.py # AST-based sandboxed code execution
|
| 72 |
+
β βββ server.py # FastAPI server: /reset, /step, /state, /health, /tasks
|
| 73 |
+
β βββ tasks/
|
| 74 |
+
β β βββ registry.py # Task registry
|
| 75 |
+
β β βββ task_easy.py # Off-by-one bug in binary search
|
| 76 |
+
β β βββ task_medium.py # Red herring authentication bug
|
| 77 |
+
β β βββ task_hard.py # Concurrency race condition
|
| 78 |
+
β βββ graders/
|
| 79 |
+
β βββ base_grader.py # Abstract base grader
|
| 80 |
+
β βββ grader_easy.py # Standard test-pass + efficiency scoring
|
| 81 |
+
β βββ grader_medium.py # Red herring detection + score floor fix
|
| 82 |
+
β βββ grader_hard.py # Sequential + concurrent stress test scoring
|
| 83 |
+
βββ server/
|
| 84 |
+
β βββ app.py # Entry point alias for openenv validate
|
| 85 |
+
βββ tests/
|
| 86 |
+
β βββ test_environment.py
|
| 87 |
+
β βββ test_sandbox.py
|
| 88 |
+
β βββ test_graders.py
|
| 89 |
+
βββ openenv.yaml # OpenEnv spec metadata
|
| 90 |
+
βββ Dockerfile
|
| 91 |
+
βββ requirements.txt
|
| 92 |
+
βββ pyproject.toml
|
| 93 |
+
βββ uv.lock # Reproducible dependency resolution
|
| 94 |
+
βββ .gitignore
|
| 95 |
+
```
|
| 96 |
|
| 97 |
---
|
| 98 |
|
| 99 |
+
## Data Models
|
| 100 |
|
| 101 |
+
### Observation
|
| 102 |
|
| 103 |
+
Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging β no more, no less.
|
| 104 |
+
|
| 105 |
+
```python
|
| 106 |
+
class FixAttempt(BaseModel):
|
| 107 |
+
attempt_number: int # 1-indexed
|
| 108 |
+
code_submitted: str # Full code the agent submitted
|
| 109 |
+
hypothesis: str # Agent's stated theory before this attempt
|
| 110 |
+
execution_output: str # Full stdout + stderr from sandbox
|
| 111 |
+
tests_passed: int
|
| 112 |
+
tests_total: int
|
| 113 |
+
execution_time_ms: int
|
| 114 |
+
timed_out: bool
|
| 115 |
+
|
| 116 |
+
class Observation(BaseModel):
|
| 117 |
+
# Fixed for the episode
|
| 118 |
+
task_id: str # "easy" | "medium" | "hard"
|
| 119 |
+
task_description: str
|
| 120 |
+
buggy_code: str # Original broken code β always visible
|
| 121 |
+
test_suite: str # Full test file β agent can read requirements
|
| 122 |
+
initial_error_output: str # Sandbox output on the buggy code at reset()
|
| 123 |
+
|
| 124 |
+
# Changes each step
|
| 125 |
+
current_code: str # Most recent submitted code
|
| 126 |
+
current_error_output: str # Test output on current_code
|
| 127 |
+
tests_passed: int
|
| 128 |
+
tests_total: int
|
| 129 |
+
previous_attempts: List[FixAttempt] # Full episode history
|
| 130 |
+
|
| 131 |
+
# Budget tracking
|
| 132 |
+
attempts_remaining: int
|
| 133 |
+
max_attempts: int
|
| 134 |
+
step_number: int
|
| 135 |
+
max_steps: int
|
| 136 |
+
done: bool
|
| 137 |
+
score_estimate: float # Running grader estimate shown to agent
|
| 138 |
+
hint_used: bool
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
### Action
|
| 142 |
+
|
| 143 |
+
The agent submits exactly one action per step. Three types:
|
| 144 |
+
|
| 145 |
+
```python
|
| 146 |
+
class Action(BaseModel):
|
| 147 |
+
action_type: str # "submit_fix" | "query_context" | "give_up"
|
| 148 |
+
|
| 149 |
+
# submit_fix β primary action
|
| 150 |
+
fixed_code: Optional[str] = None # Complete corrected code file
|
| 151 |
+
hypothesis: Optional[str] = None # REQUIRED β missing costs -0.10 reward
|
| 152 |
+
|
| 153 |
+
# query_context β request more information (first is free)
|
| 154 |
+
query_type: Optional[str] = None # "function_signature" | "related_code"
|
| 155 |
+
# | "error_explanation" | "test_details"
|
| 156 |
+
query_target: Optional[str] = None
|
| 157 |
+
|
| 158 |
+
# give_up β explicit surrender, ends episode cleanly
|
| 159 |
+
final_diagnosis: Optional[str] = None
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
### Reward
|
| 163 |
+
|
| 164 |
+
Dense signal at every step β not just binary end-of-episode.
|
| 165 |
+
|
| 166 |
+
```python
|
| 167 |
+
class Reward(BaseModel):
|
| 168 |
+
step_reward: float # This step: -1.0 to +1.0
|
| 169 |
+
cumulative_reward: float # Episode total so far
|
| 170 |
+
grader_score: float # 0.0 during episode; official score on terminal step
|
| 171 |
+
breakdown: Dict[str, float] # Itemized components for interpretability
|
| 172 |
+
```
|
| 173 |
|
| 174 |
---
|
| 175 |
|
| 176 |
+
## Reward Function
|
| 177 |
+
|
| 178 |
+
The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass.
|
| 179 |
+
|
| 180 |
+
### Step-Level Rewards
|
| 181 |
+
|
| 182 |
+
| Event | Reward | Reasoning |
|
| 183 |
+
|---|---|---|
|
| 184 |
+
| Fix increases tests passing | `+0.15 Γ (Ξpassed / total)` | Scaled progress reward |
|
| 185 |
+
| Fix decreases tests passing | `-0.10 Γ (Ξfailed / total)` | Regression penalty |
|
| 186 |
+
| Fix makes no change | `-0.05` | Stagnation penalty β discourages repetition |
|
| 187 |
+
| All tests pass | `+0.50` | Major bonus on top of progress reward |
|
| 188 |
+
| Sandbox timeout in submitted code | `-0.10` | Penalizes infinite loops |
|
| 189 |
+
| `submit_fix` without hypothesis | `-0.10` | Hypothesis is required |
|
| 190 |
+
| Repeated `query_context` calls | `-0.05` each after first | Diminishing returns on hints |
|
| 191 |
+
| Episode truncated at max_steps | `-0.20` | Penalizes indecision |
|
| 192 |
+
|
| 193 |
+
### Episode-Level Grader Score (0.0 β 1.0)
|
| 194 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
```
|
| 196 |
+
grader_score = test_pass_ratio Γ 0.60
|
| 197 |
+
+ efficiency_bonus Γ 0.20
|
| 198 |
+
+ hypothesis_accuracy Γ 0.15
|
| 199 |
+
+ early_solve_bonus Γ 0.05
|
| 200 |
|
| 201 |
+
where:
|
| 202 |
+
test_pass_ratio = agent_best_tests_passed / tests_total
|
| 203 |
+
(from agent submissions only β not initial buggy code)
|
| 204 |
+
efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts)
|
| 205 |
+
hypothesis_accuracy = fraction of hypotheses correctly identifying bug location
|
| 206 |
+
early_solve_bonus = 0.05 if all tests pass within ceil(max_attempts / 3) attempts
|
| 207 |
```
|
| 208 |
|
| 209 |
+
**Score floor design:** `test_pass_ratio` is calculated only from the agent's submitted attempts β never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline.
|
| 210 |
+
|
| 211 |
+
---
|
| 212 |
+
|
| 213 |
+
## Tasks
|
| 214 |
+
|
| 215 |
+
### Task 1 β Easy: Off-by-One Bug
|
| 216 |
+
|
| 217 |
+
**Difficulty:** π’ Easy | **Max attempts:** 5 | **Max steps:** 8 | **Tests:** 8
|
| 218 |
+
|
| 219 |
+
A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem.
|
| 220 |
+
|
| 221 |
+
**Why it's easy:** The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent.
|
| 222 |
+
|
| 223 |
+
**What the grader checks:** Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently?
|
| 224 |
+
|
| 225 |
+
**Expected GPT-4o baseline:** ~0.85
|
| 226 |
+
|
| 227 |
+
---
|
| 228 |
+
|
| 229 |
+
### Task 2 β Medium: Red Herring Authentication Bug
|
| 230 |
+
|
| 231 |
+
**Difficulty:** π‘ Medium | **Max attempts:** 7 | **Max steps:** 15 | **Tests:** 10 (6 pass, 4 fail on buggy code)
|
| 232 |
+
|
| 233 |
+
An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. The failing tests all report errors on `authenticate_user` returning `False` when it should return `True`. However, `authenticate_user` is completely correct. So is `validate_password`. The actual bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` β producing a `"b'...'"` prefix that corrupts the hash string.
|
| 234 |
+
|
| 235 |
+
The red herring: the error message names `authenticate_user`. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through `validate_password` to find that `hash_password` produces a different format than what the test database expects.
|
| 236 |
+
|
| 237 |
+
**Why it's medium:** The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time.
|
| 238 |
+
|
| 239 |
+
**Red herring detection in grader:** A hypothesis that mentions only `authenticate_user` scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies `hash_password` with supporting detail scores 1.0.
|
| 240 |
+
|
| 241 |
+
**Expected GPT-4o baseline:** ~0.50
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
+
|
| 245 |
+
### Task 3 β Hard: Concurrency Race Condition
|
| 246 |
+
|
| 247 |
+
**Difficulty:** π΄ Hard | **Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (all 8 pass on buggy code)
|
| 248 |
+
|
| 249 |
+
A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave.
|
| 250 |
+
|
| 251 |
+
```python
|
| 252 |
+
def increment(self):
|
| 253 |
+
with self._lock:
|
| 254 |
+
current = self.count # read β lock released here
|
| 255 |
+
new_val = current + 1 # modify β no lock held
|
| 256 |
+
with self._lock:
|
| 257 |
+
self.count = new_val # write β race window exploited
|
| 258 |
+
```
|
| 259 |
+
|
| 260 |
+
The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test.
|
| 261 |
+
|
| 262 |
+
**Why it's hard:** Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite β a capability current frontier models lack most of the time.
|
| 263 |
+
|
| 264 |
+
**Hard task grader breakdown:**
|
| 265 |
+
- Sequential tests pass: 0.40 (agent submissions only)
|
| 266 |
+
- 1000-thread concurrent stress test passes: 0.30 (run 3Γ β must pass all 3 for full credit)
|
| 267 |
+
- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20
|
| 268 |
+
- Efficiency bonus (fixed within 5 attempts): 0.10
|
| 269 |
+
|
| 270 |
+
**Expected GPT-4o baseline:** ~0.18
|
| 271 |
+
|
| 272 |
+
---
|
| 273 |
+
|
| 274 |
+
## Security Sandbox
|
| 275 |
+
|
| 276 |
+
Every `submit_fix` action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in `env/sandbox.py`.
|
| 277 |
+
|
| 278 |
+
### Multi-Layer Protection
|
| 279 |
+
|
| 280 |
+
**Layer 1 β AST Import Filtering:** Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of `os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `glob`, `pickle`, `ctypes`, `multiprocessing`, and others causes immediate rejection with a clear error message. This uses `ast.parse()` + `ast.walk()` β not string matching, which can be bypassed.
|
| 281 |
+
|
| 282 |
+
**Layer 2 β Subprocess Isolation:** Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no `PATH` beyond `/usr/bin`, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server.
|
| 283 |
+
|
| 284 |
+
**Layer 3 β Hard Timeout:** Every execution is killed after 10 seconds via `subprocess.run(timeout=10)`. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
|
| 285 |
+
|
| 286 |
+
**Layer 4 β Memory Limit:** 256MB per execution via environment isolation.
|
| 287 |
+
|
| 288 |
+
**Threading exception:** The hard task requires `threading` to create the race condition and to verify the fix. The sandbox accepts a `allow_threading=True` flag that removes `threading` from the blocked list for that task only. All other tasks have threading blocked.
|
| 289 |
+
|
| 290 |
+
---
|
| 291 |
+
|
| 292 |
+
## API Endpoints
|
| 293 |
+
|
| 294 |
+
The environment is served as a FastAPI application on port 8000.
|
| 295 |
+
|
| 296 |
+
| Endpoint | Method | Description |
|
| 297 |
+
|---|---|---|
|
| 298 |
+
| `/` | GET | API overview β lists all endpoints and tasks |
|
| 299 |
+
| `/health` | GET | Health check β always returns HTTP 200 |
|
| 300 |
+
| `/tasks` | GET | List all tasks with full metadata |
|
| 301 |
+
| `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
|
| 302 |
+
| `/step` | POST | Submit one action. Body: Action JSON |
|
| 303 |
+
| `/state` | GET | Full internal episode state |
|
| 304 |
+
|
| 305 |
+
All endpoints return HTTP 200 always β errors appear in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response.
|
| 306 |
+
|
| 307 |
+
---
|
| 308 |
+
|
| 309 |
+
## OpenEnv Compliance
|
| 310 |
+
|
| 311 |
+
```yaml
|
| 312 |
+
# openenv.yaml
|
| 313 |
+
name: agentdebugger-env
|
| 314 |
+
version: 1.0.0
|
| 315 |
+
domain: software_engineering
|
| 316 |
+
observation_type: structured
|
| 317 |
+
action_type: structured
|
| 318 |
+
reward_type: dense
|
| 319 |
+
episode_termination: action_or_step_limit
|
| 320 |
+
tasks:
|
| 321 |
+
- id: easy | difficulty: easy | max_steps: 8 | max_attempts: 5
|
| 322 |
+
- id: medium | difficulty: medium | max_steps: 15 | max_attempts: 7
|
| 323 |
+
- id: hard | difficulty: hard | max_steps: 25 | max_attempts: 10
|
| 324 |
+
```
|
| 325 |
+
|
| 326 |
+
Validation output:
|
| 327 |
+
```
|
| 328 |
+
β openenv.yaml valid
|
| 329 |
+
β GET /health β 200
|
| 330 |
+
β POST /reset β valid Observation
|
| 331 |
+
β POST /step β (Observation, Reward, bool, dict)
|
| 332 |
+
β GET /state β dict
|
| 333 |
+
β 3 tasks registered: easy, medium, hard
|
| 334 |
+
β grader_easy: score in [0.0, 1.0] β PASS
|
| 335 |
+
β grader_medium: score in [0.0, 1.0] β PASS
|
| 336 |
+
β grader_hard: score in [0.0, 1.0] β PASS
|
| 337 |
+
β inference.py present in root directory
|
| 338 |
+
openenv validate: PASSED
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
---
|
| 342 |
+
|
| 343 |
+
## Baseline Results
|
| 344 |
+
|
| 345 |
+
Evaluated using `gpt-4o` with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged.
|
| 346 |
+
|
| 347 |
+
| Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts | Avg Steps |
|
| 348 |
+
|---|---|---|---|---|---|---|
|
| 349 |
+
| Off-by-One Bug | Easy | 0.85 | Β±0.04 | 100% | 1.8 | 4.2 |
|
| 350 |
+
| Red Herring Auth | Medium | 0.50 | Β±0.10 | 60% | 4.2 | 10.6 |
|
| 351 |
+
| Race Condition | Hard | 0.18 | Β±0.09 | 20% | 8.7 | 22.1 |
|
| 352 |
+
| **Overall Mean** | | **0.51** | | **60%** | | |
|
| 353 |
+
|
| 354 |
+
**Key observations:**
|
| 355 |
+
|
| 356 |
+
**Easy task:** GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes.
|
| 357 |
+
|
| 358 |
+
**Medium task:** In ~40% of runs, GPT-4o follows the red herring and spends 2β3 attempts trying to fix `authenticate_user` before eventually tracing back to `hash_password`. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly.
|
| 359 |
+
|
| 360 |
+
**Hard task:** GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated.
|
| 361 |
+
|
| 362 |
+
---
|
| 363 |
+
|
| 364 |
+
## Setup & Usage
|
| 365 |
+
|
| 366 |
+
### Local Development
|
| 367 |
+
|
| 368 |
```bash
|
| 369 |
+
git clone https://github.com/shasshaank/AgentDebuggerEnv
|
| 370 |
+
cd AgentDebuggerEnv
|
| 371 |
+
pip install -r requirements.txt
|
| 372 |
+
|
| 373 |
+
# Start the environment server
|
| 374 |
+
uvicorn env.server:app --reload --port 8000
|
| 375 |
+
|
| 376 |
+
# Verify it's running
|
| 377 |
+
curl http://localhost:8000/health
|
| 378 |
+
# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}
|
| 379 |
+
|
| 380 |
+
# Run baseline inference
|
| 381 |
export API_BASE_URL="https://api.openai.com/v1"
|
| 382 |
export MODEL_NAME="gpt-4o"
|
| 383 |
+
export HF_TOKEN="your_openai_api_key"
|
| 384 |
export ENV_BASE_URL="http://localhost:8000"
|
| 385 |
python inference.py
|
| 386 |
```
|
| 387 |
|
| 388 |
+
### Docker
|
| 389 |
+
|
| 390 |
+
```bash
|
| 391 |
+
# Build
|
| 392 |
+
docker build -t agentdebugger-env .
|
| 393 |
+
|
| 394 |
+
# Run
|
| 395 |
+
docker run -p 8000:8000 agentdebugger-env
|
| 396 |
+
|
| 397 |
+
# Run with inference against the containerized environment
|
| 398 |
+
docker run -p 8000:8000 \
|
| 399 |
+
-e API_BASE_URL="https://api.openai.com/v1" \
|
| 400 |
+
-e MODEL_NAME="gpt-4o" \
|
| 401 |
+
-e HF_TOKEN="your_key" \
|
| 402 |
+
agentdebugger-env
|
| 403 |
+
```
|
| 404 |
+
|
| 405 |
+
### Quick API Test
|
| 406 |
+
|
| 407 |
+
```bash
|
| 408 |
+
# Reset the easy task
|
| 409 |
+
curl -X POST http://localhost:8000/reset \
|
| 410 |
+
-H "Content-Type: application/json" \
|
| 411 |
+
-d '{"task_id": "easy"}'
|
| 412 |
+
|
| 413 |
+
# Submit a fix with hypothesis
|
| 414 |
+
curl -X POST http://localhost:8000/step \
|
| 415 |
+
-H "Content-Type: application/json" \
|
| 416 |
+
-d '{
|
| 417 |
+
"action_type": "submit_fix",
|
| 418 |
+
"fixed_code": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = (left + right) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1",
|
| 419 |
+
"hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element."
|
| 420 |
+
}'
|
| 421 |
+
```
|
| 422 |
+
|
| 423 |
+
---
|
| 424 |
+
|
| 425 |
+
## Why This Environment Matters for Agent Research
|
| 426 |
+
|
| 427 |
+
Four specific failure modes in LLM agents are measurable and scorable here for the first time:
|
| 428 |
+
|
| 429 |
+
**1. Red herring susceptibility** β Does the agent overtrust error messages over data flow analysis? The medium task's `hypothesis_accuracy` score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error.
|
| 430 |
+
|
| 431 |
+
**2. Stagnation under uncertainty** β Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. An agent that submits the same code twice scores negatively twice.
|
| 432 |
+
|
| 433 |
+
**3. Exploration vs. exploitation** β The `query_context` action costs a step but provides information. The first query is free; subsequent ones cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes.
|
| 434 |
+
|
| 435 |
+
**4. Test-suite as sufficient proof** β The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code β without recognizing the concurrency issue β scores at most 0.40 and fails the most important grader component.
|
| 436 |
+
|
| 437 |
+
All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning.
|
| 438 |
+
|
| 439 |
---
|
| 440 |
|
| 441 |
+
## Design Decisions
|
| 442 |
+
|
| 443 |
+
**Why require a hypothesis?** The `hypothesis` field is mandatory on every `submit_fix` action. Missing it costs `-0.10` and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score `hypothesis_accuracy` separately from `test_pass_ratio`. It also prevents degenerate strategies of submitting random code until something passes.
|
| 444 |
|
| 445 |
+
**Why is `best_tests_passed` calculated from agent attempts only?** The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the `attempts` list β which contains only what the agent actually submitted β ensuring the score floor is 0.0.
|
| 446 |
|
| 447 |
+
**Why run the concurrent stress test 3 times?** Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 β partial credit for progress, but not full credit.
|
| 448 |
+
|
| 449 |
+
**Why not use pytest directly?** Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms.
|
|
|
|
| 450 |
|
| 451 |
---
|
| 452 |
|
| 453 |
+
## Environment Configuration
|
| 454 |
+
|
| 455 |
+
```bash
|
| 456 |
+
# Required for inference.py
|
| 457 |
+
API_BASE_URL # LLM API endpoint (e.g. https://api.openai.com/v1)
|
| 458 |
+
MODEL_NAME # Model identifier (e.g. gpt-4o)
|
| 459 |
+
HF_TOKEN # API key / HuggingFace token
|
| 460 |
+
|
| 461 |
+
# Optional β defaults to localhost:8000
|
| 462 |
+
ENV_BASE_URL # Environment server URL
|
| 463 |
+
```
|
| 464 |
+
|
| 465 |
+
---
|
| 466 |
+
|
| 467 |
+
## License & Attribution
|
| 468 |
+
|
| 469 |
+
**License:** MIT β see [LICENSE](LICENSE)
|
| 470 |
+
|
| 471 |
+
**Author:** Shashaank
|
| 472 |
+
|
| 473 |
+
**Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon
|
| 474 |
+
|
| 475 |
+
**Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
|