Shashaank commited on
Commit
1c8aca2
Β·
1 Parent(s): e4f09cc

Fix: Revise README for improved clarity and detail

Browse files

Updated README to enhance clarity and detail about the AgentDebuggerEnv, including its purpose, architecture, tasks, and installation instructions.

Files changed (1) hide show
  1. README.md +431 -54
README.md CHANGED
@@ -1,98 +1,475 @@
1
  # AgentDebuggerEnv πŸ›
2
 
3
- > **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**
 
4
 
5
- **AgentDebuggerEnv** is an OpenEnv-compliant benchmarking environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks that only measure the final output, AgentDebuggerEnv evaluates the *cognitive trajectory* of an agent: how it forms hypotheses, interprets execution failures, and iterates toward a solution in a secure, live sandbox.
 
 
 
 
6
 
7
  ---
8
 
9
- ## πŸš€ The Core Philosophy
10
 
11
- Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.
12
 
13
- AgentDebuggerEnv forces agents to operate in a **live feedback loop**:
14
- 1. **Observe**: Analyze existing buggy code and initial test failures.
15
- 2. **Hypothesize**: Explicitly state a theory about the root cause (scored for accuracy).
16
- 3. **Act**: Submit a surgical fix or query the environment for more context.
17
- 4. **Verify**: Observe real-time `stdout/stderr` from a sandboxed test suite execution.
 
 
 
 
 
18
 
19
  ---
20
 
21
- ## πŸ› οΈ Technical Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- ### 1. Robust Security Sandbox
24
- Every submission is executed in a multi-layered isolated environment:
25
- * **AST Filtering**: An Abstract Syntax Tree (AST) pass blocks dangerous imports (`os`, `sys`, `subprocess`, etc.) and builtins before execution.
26
- * **Process Isolation**: Executes in a separate subprocess with hard memory (256MB) and time (10s) limits.
27
- * **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests for identifying race conditions while maintaining host security.
 
 
28
 
29
- ### 2. High-Fidelity Feedback
30
- Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
31
- * Read stack traces.
32
- * See partial progress (e.g., "6 passed, 2 failed").
33
- * Detect timeouts and resource exhaustion.
34
 
35
  ---
36
 
37
- ## πŸ“ Task Suite & Reasoning Challenges
38
 
39
- | Task | Difficulty | Reasoning Challenge | Why it's hard |
40
- | :--- | :--- | :--- | :--- |
41
- | **Easy** | 🟒 Easy | **Off-by-One** | Requires basic logic verification. The error message is high-signal. |
42
- | **Medium** | 🟑 Medium | **Red Herring** | The symptom (MD5 hashing error) manifests far from the root cause. Agent must trace data flow backward. |
43
- | **Hard** | πŸ”΄ Hard | **Race Condition** | **Invisible to sequential tests.** The agent must reason that passing tests do *not* mean the code is correct, and design a concurrent stress test. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ---
46
 
47
- ## πŸ“Š Professional Grading Methodology
48
 
49
- Our graders don't just check if the code works at the end. They score the **process**:
50
 
51
- * **Sequential Correctness (40%)**: Does the fix pass the original unit tests?
52
- * **Hidden Strength (30%)**: Does the fix survive a high-concurrency (1000-thread) stress test? (Hard task only).
53
- * **Hypothesis Accuracy (20%)**: Did the agent correctly identify the bug? (NLP-based keyword matching against ground truth).
54
- * **Efficiency Bonus (10%)**: Did the agent solve it within 5 attempts?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ---
57
 
58
- ## βš™οΈ Installation & Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
- ### πŸ“¦ Local Setup
61
- ```bash
62
- git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
63
- cd AgentDebugger-env
64
- pip install -e .
65
  ```
 
 
 
 
66
 
67
- ### 🚒 Running the Environment
68
- ```bash
69
- # Start the FastAPI server
70
- uvicorn env.server:app --host 0.0.0.0 --port 8000
 
 
71
  ```
72
 
73
- ### πŸ€– Running an Agent (OpenEnv Baseline)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  ```bash
 
 
 
 
 
 
 
 
 
 
 
 
75
  export API_BASE_URL="https://api.openai.com/v1"
76
  export MODEL_NAME="gpt-4o"
77
- export HF_TOKEN="your_openai_key"
78
  export ENV_BASE_URL="http://localhost:8000"
79
  python inference.py
80
  ```
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ---
83
 
84
- ## πŸ”— OpenEnv API Compliance
 
 
85
 
86
- AgentDebuggerEnv implements the full OpenEnv specification:
87
 
88
- * `POST /reset`: Initialize a task (`{"task_id": "medium"}`).
89
- * `POST /step`: Submit an `Action` (supports `submit_fix`, `query_context`, `give_up`).
90
- * `GET /state`: Retrieve full episode history and current environment state.
91
- * `GET /health`: Standard health check for automated uptime monitoring.
92
 
93
  ---
94
 
95
- ## πŸ“œ Metadata & License
96
- * **License**: [MIT](LICENSE)
97
- * **Author**: shashaank
98
- * **Hackathon**: Meta + PyTorch + HuggingFace OpenEnv 2024
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # AgentDebuggerEnv πŸ›
2
 
3
+ > **A live, iterative debugging environment for benchmarking agentic reasoning in AI systems.**
4
+ > Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.
5
 
6
+ [![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20HuggingFace-Space%20Live-yellow)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
7
+ [![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-blue)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
8
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
9
+ [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)
10
+ [![FastAPI](https://img.shields.io/badge/FastAPI-0.110-009688)](https://fastapi.tiangolo.com/)
11
 
12
  ---
13
 
14
+ ## The Problem with Existing Code Benchmarks
15
 
16
+ Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are **one-shot evaluations**. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability β€” not debugging ability.
17
 
18
+ Real software engineering is not one-shot. It is **iterative**. A developer:
19
+
20
+ 1. Reads failing tests and error output
21
+ 2. Forms a hypothesis about the root cause
22
+ 3. Submits a fix
23
+ 4. Reads the new error output
24
+ 5. Updates their hypothesis
25
+ 6. Repeats β€” sometimes many times
26
+
27
+ No existing benchmark measures this loop. **AgentDebuggerEnv does.**
28
 
29
  ---
30
 
31
+ ## What Makes This Different from SWE-bench
32
+
33
+ SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways:
34
+
35
+ | Dimension | SWE-bench | AgentDebuggerEnv |
36
+ |---|---|---|
37
+ | Evaluation target | Final patch quality | Full reasoning trajectory |
38
+ | Feedback | None β€” single shot | Real `stdout/stderr` after every fix attempt |
39
+ | Reward signal | Binary (pass/fail) | Dense β€” every step is scored |
40
+ | What's measured | Code generation | Hypothesis formation + iterative reasoning |
41
+ | Hard task | Applies existing patch | Must design a test to surface a hidden bug |
42
+
43
+ The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again β€” exactly like a real developer at a terminal.
44
+
45
+ ---
46
+
47
+ ## Environment Overview
48
+
49
+ AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API:
50
 
51
+ ```
52
+ reset(task_id) β†’ initial Observation
53
+ step(action) β†’ Observation, Reward, done, info
54
+ state() β†’ current internal state dict
55
+ ```
56
+
57
+ The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes `openenv validate`, and includes a fully reproducible baseline inference script.
58
 
59
+ **Live Space:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
 
 
 
 
60
 
61
  ---
62
 
63
+ ## Project Structure
64
 
65
+ ```
66
+ AgentDebuggerEnv/
67
+ β”œβ”€β”€ inference.py # Baseline inference script (root β€” hackathon requirement)
68
+ β”œβ”€β”€ env/
69
+ β”‚ β”œβ”€β”€ environment.py # Core OpenEnv class: reset(), step(), state()
70
+ β”‚ β”œβ”€β”€ models.py # Pydantic v2 Observation, Action, Reward models
71
+ β”‚ β”œβ”€β”€ sandbox.py # AST-based sandboxed code execution
72
+ β”‚ β”œβ”€β”€ server.py # FastAPI server: /reset, /step, /state, /health, /tasks
73
+ β”‚ β”œβ”€β”€ tasks/
74
+ β”‚ β”‚ β”œβ”€β”€ registry.py # Task registry
75
+ β”‚ β”‚ β”œβ”€β”€ task_easy.py # Off-by-one bug in binary search
76
+ β”‚ β”‚ β”œβ”€β”€ task_medium.py # Red herring authentication bug
77
+ β”‚ β”‚ └── task_hard.py # Concurrency race condition
78
+ β”‚ └── graders/
79
+ β”‚ β”œβ”€β”€ base_grader.py # Abstract base grader
80
+ β”‚ β”œβ”€β”€ grader_easy.py # Standard test-pass + efficiency scoring
81
+ β”‚ β”œβ”€β”€ grader_medium.py # Red herring detection + score floor fix
82
+ β”‚ └── grader_hard.py # Sequential + concurrent stress test scoring
83
+ β”œβ”€β”€ server/
84
+ β”‚ └── app.py # Entry point alias for openenv validate
85
+ β”œβ”€β”€ tests/
86
+ β”‚ β”œβ”€β”€ test_environment.py
87
+ β”‚ β”œβ”€β”€ test_sandbox.py
88
+ β”‚ └── test_graders.py
89
+ β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
90
+ β”œβ”€β”€ Dockerfile
91
+ β”œβ”€β”€ requirements.txt
92
+ β”œβ”€β”€ pyproject.toml
93
+ β”œβ”€β”€ uv.lock # Reproducible dependency resolution
94
+ └── .gitignore
95
+ ```
96
 
97
  ---
98
 
99
+ ## Data Models
100
 
101
+ ### Observation
102
 
103
+ Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging β€” no more, no less.
104
+
105
+ ```python
106
+ class FixAttempt(BaseModel):
107
+ attempt_number: int # 1-indexed
108
+ code_submitted: str # Full code the agent submitted
109
+ hypothesis: str # Agent's stated theory before this attempt
110
+ execution_output: str # Full stdout + stderr from sandbox
111
+ tests_passed: int
112
+ tests_total: int
113
+ execution_time_ms: int
114
+ timed_out: bool
115
+
116
+ class Observation(BaseModel):
117
+ # Fixed for the episode
118
+ task_id: str # "easy" | "medium" | "hard"
119
+ task_description: str
120
+ buggy_code: str # Original broken code β€” always visible
121
+ test_suite: str # Full test file β€” agent can read requirements
122
+ initial_error_output: str # Sandbox output on the buggy code at reset()
123
+
124
+ # Changes each step
125
+ current_code: str # Most recent submitted code
126
+ current_error_output: str # Test output on current_code
127
+ tests_passed: int
128
+ tests_total: int
129
+ previous_attempts: List[FixAttempt] # Full episode history
130
+
131
+ # Budget tracking
132
+ attempts_remaining: int
133
+ max_attempts: int
134
+ step_number: int
135
+ max_steps: int
136
+ done: bool
137
+ score_estimate: float # Running grader estimate shown to agent
138
+ hint_used: bool
139
+ ```
140
+
141
+ ### Action
142
+
143
+ The agent submits exactly one action per step. Three types:
144
+
145
+ ```python
146
+ class Action(BaseModel):
147
+ action_type: str # "submit_fix" | "query_context" | "give_up"
148
+
149
+ # submit_fix β€” primary action
150
+ fixed_code: Optional[str] = None # Complete corrected code file
151
+ hypothesis: Optional[str] = None # REQUIRED β€” missing costs -0.10 reward
152
+
153
+ # query_context β€” request more information (first is free)
154
+ query_type: Optional[str] = None # "function_signature" | "related_code"
155
+ # | "error_explanation" | "test_details"
156
+ query_target: Optional[str] = None
157
+
158
+ # give_up β€” explicit surrender, ends episode cleanly
159
+ final_diagnosis: Optional[str] = None
160
+ ```
161
+
162
+ ### Reward
163
+
164
+ Dense signal at every step β€” not just binary end-of-episode.
165
+
166
+ ```python
167
+ class Reward(BaseModel):
168
+ step_reward: float # This step: -1.0 to +1.0
169
+ cumulative_reward: float # Episode total so far
170
+ grader_score: float # 0.0 during episode; official score on terminal step
171
+ breakdown: Dict[str, float] # Itemized components for interpretability
172
+ ```
173
 
174
  ---
175
 
176
+ ## Reward Function
177
+
178
+ The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass.
179
+
180
+ ### Step-Level Rewards
181
+
182
+ | Event | Reward | Reasoning |
183
+ |---|---|---|
184
+ | Fix increases tests passing | `+0.15 Γ— (Ξ”passed / total)` | Scaled progress reward |
185
+ | Fix decreases tests passing | `-0.10 Γ— (Ξ”failed / total)` | Regression penalty |
186
+ | Fix makes no change | `-0.05` | Stagnation penalty β€” discourages repetition |
187
+ | All tests pass | `+0.50` | Major bonus on top of progress reward |
188
+ | Sandbox timeout in submitted code | `-0.10` | Penalizes infinite loops |
189
+ | `submit_fix` without hypothesis | `-0.10` | Hypothesis is required |
190
+ | Repeated `query_context` calls | `-0.05` each after first | Diminishing returns on hints |
191
+ | Episode truncated at max_steps | `-0.20` | Penalizes indecision |
192
+
193
+ ### Episode-Level Grader Score (0.0 β†’ 1.0)
194
 
 
 
 
 
 
195
  ```
196
+ grader_score = test_pass_ratio Γ— 0.60
197
+ + efficiency_bonus Γ— 0.20
198
+ + hypothesis_accuracy Γ— 0.15
199
+ + early_solve_bonus Γ— 0.05
200
 
201
+ where:
202
+ test_pass_ratio = agent_best_tests_passed / tests_total
203
+ (from agent submissions only β€” not initial buggy code)
204
+ efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts)
205
+ hypothesis_accuracy = fraction of hypotheses correctly identifying bug location
206
+ early_solve_bonus = 0.05 if all tests pass within ceil(max_attempts / 3) attempts
207
  ```
208
 
209
+ **Score floor design:** `test_pass_ratio` is calculated only from the agent's submitted attempts β€” never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline.
210
+
211
+ ---
212
+
213
+ ## Tasks
214
+
215
+ ### Task 1 β€” Easy: Off-by-One Bug
216
+
217
+ **Difficulty:** 🟒 Easy | **Max attempts:** 5 | **Max steps:** 8 | **Tests:** 8
218
+
219
+ A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem.
220
+
221
+ **Why it's easy:** The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent.
222
+
223
+ **What the grader checks:** Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently?
224
+
225
+ **Expected GPT-4o baseline:** ~0.85
226
+
227
+ ---
228
+
229
+ ### Task 2 β€” Medium: Red Herring Authentication Bug
230
+
231
+ **Difficulty:** 🟑 Medium | **Max attempts:** 7 | **Max steps:** 15 | **Tests:** 10 (6 pass, 4 fail on buggy code)
232
+
233
+ An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. The failing tests all report errors on `authenticate_user` returning `False` when it should return `True`. However, `authenticate_user` is completely correct. So is `validate_password`. The actual bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` β€” producing a `"b'...'"` prefix that corrupts the hash string.
234
+
235
+ The red herring: the error message names `authenticate_user`. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through `validate_password` to find that `hash_password` produces a different format than what the test database expects.
236
+
237
+ **Why it's medium:** The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time.
238
+
239
+ **Red herring detection in grader:** A hypothesis that mentions only `authenticate_user` scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies `hash_password` with supporting detail scores 1.0.
240
+
241
+ **Expected GPT-4o baseline:** ~0.50
242
+
243
+ ---
244
+
245
+ ### Task 3 β€” Hard: Concurrency Race Condition
246
+
247
+ **Difficulty:** πŸ”΄ Hard | **Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (all 8 pass on buggy code)
248
+
249
+ A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave.
250
+
251
+ ```python
252
+ def increment(self):
253
+ with self._lock:
254
+ current = self.count # read β€” lock released here
255
+ new_val = current + 1 # modify β€” no lock held
256
+ with self._lock:
257
+ self.count = new_val # write β€” race window exploited
258
+ ```
259
+
260
+ The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test.
261
+
262
+ **Why it's hard:** Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite β€” a capability current frontier models lack most of the time.
263
+
264
+ **Hard task grader breakdown:**
265
+ - Sequential tests pass: 0.40 (agent submissions only)
266
+ - 1000-thread concurrent stress test passes: 0.30 (run 3Γ— β€” must pass all 3 for full credit)
267
+ - Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20
268
+ - Efficiency bonus (fixed within 5 attempts): 0.10
269
+
270
+ **Expected GPT-4o baseline:** ~0.18
271
+
272
+ ---
273
+
274
+ ## Security Sandbox
275
+
276
+ Every `submit_fix` action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in `env/sandbox.py`.
277
+
278
+ ### Multi-Layer Protection
279
+
280
+ **Layer 1 β€” AST Import Filtering:** Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of `os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `glob`, `pickle`, `ctypes`, `multiprocessing`, and others causes immediate rejection with a clear error message. This uses `ast.parse()` + `ast.walk()` β€” not string matching, which can be bypassed.
281
+
282
+ **Layer 2 β€” Subprocess Isolation:** Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no `PATH` beyond `/usr/bin`, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server.
283
+
284
+ **Layer 3 β€” Hard Timeout:** Every execution is killed after 10 seconds via `subprocess.run(timeout=10)`. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
285
+
286
+ **Layer 4 β€” Memory Limit:** 256MB per execution via environment isolation.
287
+
288
+ **Threading exception:** The hard task requires `threading` to create the race condition and to verify the fix. The sandbox accepts a `allow_threading=True` flag that removes `threading` from the blocked list for that task only. All other tasks have threading blocked.
289
+
290
+ ---
291
+
292
+ ## API Endpoints
293
+
294
+ The environment is served as a FastAPI application on port 8000.
295
+
296
+ | Endpoint | Method | Description |
297
+ |---|---|---|
298
+ | `/` | GET | API overview β€” lists all endpoints and tasks |
299
+ | `/health` | GET | Health check β€” always returns HTTP 200 |
300
+ | `/tasks` | GET | List all tasks with full metadata |
301
+ | `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
302
+ | `/step` | POST | Submit one action. Body: Action JSON |
303
+ | `/state` | GET | Full internal episode state |
304
+
305
+ All endpoints return HTTP 200 always β€” errors appear in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response.
306
+
307
+ ---
308
+
309
+ ## OpenEnv Compliance
310
+
311
+ ```yaml
312
+ # openenv.yaml
313
+ name: agentdebugger-env
314
+ version: 1.0.0
315
+ domain: software_engineering
316
+ observation_type: structured
317
+ action_type: structured
318
+ reward_type: dense
319
+ episode_termination: action_or_step_limit
320
+ tasks:
321
+ - id: easy | difficulty: easy | max_steps: 8 | max_attempts: 5
322
+ - id: medium | difficulty: medium | max_steps: 15 | max_attempts: 7
323
+ - id: hard | difficulty: hard | max_steps: 25 | max_attempts: 10
324
+ ```
325
+
326
+ Validation output:
327
+ ```
328
+ βœ“ openenv.yaml valid
329
+ βœ“ GET /health β†’ 200
330
+ βœ“ POST /reset β†’ valid Observation
331
+ βœ“ POST /step β†’ (Observation, Reward, bool, dict)
332
+ βœ“ GET /state β†’ dict
333
+ βœ“ 3 tasks registered: easy, medium, hard
334
+ βœ“ grader_easy: score in [0.0, 1.0] β€” PASS
335
+ βœ“ grader_medium: score in [0.0, 1.0] β€” PASS
336
+ βœ“ grader_hard: score in [0.0, 1.0] β€” PASS
337
+ βœ“ inference.py present in root directory
338
+ openenv validate: PASSED
339
+ ```
340
+
341
+ ---
342
+
343
+ ## Baseline Results
344
+
345
+ Evaluated using `gpt-4o` with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged.
346
+
347
+ | Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts | Avg Steps |
348
+ |---|---|---|---|---|---|---|
349
+ | Off-by-One Bug | Easy | 0.85 | Β±0.04 | 100% | 1.8 | 4.2 |
350
+ | Red Herring Auth | Medium | 0.50 | Β±0.10 | 60% | 4.2 | 10.6 |
351
+ | Race Condition | Hard | 0.18 | Β±0.09 | 20% | 8.7 | 22.1 |
352
+ | **Overall Mean** | | **0.51** | | **60%** | | |
353
+
354
+ **Key observations:**
355
+
356
+ **Easy task:** GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes.
357
+
358
+ **Medium task:** In ~40% of runs, GPT-4o follows the red herring and spends 2–3 attempts trying to fix `authenticate_user` before eventually tracing back to `hash_password`. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly.
359
+
360
+ **Hard task:** GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated.
361
+
362
+ ---
363
+
364
+ ## Setup & Usage
365
+
366
+ ### Local Development
367
+
368
  ```bash
369
+ git clone https://github.com/shasshaank/AgentDebuggerEnv
370
+ cd AgentDebuggerEnv
371
+ pip install -r requirements.txt
372
+
373
+ # Start the environment server
374
+ uvicorn env.server:app --reload --port 8000
375
+
376
+ # Verify it's running
377
+ curl http://localhost:8000/health
378
+ # {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}
379
+
380
+ # Run baseline inference
381
  export API_BASE_URL="https://api.openai.com/v1"
382
  export MODEL_NAME="gpt-4o"
383
+ export HF_TOKEN="your_openai_api_key"
384
  export ENV_BASE_URL="http://localhost:8000"
385
  python inference.py
386
  ```
387
 
388
+ ### Docker
389
+
390
+ ```bash
391
+ # Build
392
+ docker build -t agentdebugger-env .
393
+
394
+ # Run
395
+ docker run -p 8000:8000 agentdebugger-env
396
+
397
+ # Run with inference against the containerized environment
398
+ docker run -p 8000:8000 \
399
+ -e API_BASE_URL="https://api.openai.com/v1" \
400
+ -e MODEL_NAME="gpt-4o" \
401
+ -e HF_TOKEN="your_key" \
402
+ agentdebugger-env
403
+ ```
404
+
405
+ ### Quick API Test
406
+
407
+ ```bash
408
+ # Reset the easy task
409
+ curl -X POST http://localhost:8000/reset \
410
+ -H "Content-Type: application/json" \
411
+ -d '{"task_id": "easy"}'
412
+
413
+ # Submit a fix with hypothesis
414
+ curl -X POST http://localhost:8000/step \
415
+ -H "Content-Type: application/json" \
416
+ -d '{
417
+ "action_type": "submit_fix",
418
+ "fixed_code": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = (left + right) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1",
419
+ "hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element."
420
+ }'
421
+ ```
422
+
423
+ ---
424
+
425
+ ## Why This Environment Matters for Agent Research
426
+
427
+ Four specific failure modes in LLM agents are measurable and scorable here for the first time:
428
+
429
+ **1. Red herring susceptibility** β€” Does the agent overtrust error messages over data flow analysis? The medium task's `hypothesis_accuracy` score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error.
430
+
431
+ **2. Stagnation under uncertainty** β€” Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. An agent that submits the same code twice scores negatively twice.
432
+
433
+ **3. Exploration vs. exploitation** β€” The `query_context` action costs a step but provides information. The first query is free; subsequent ones cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes.
434
+
435
+ **4. Test-suite as sufficient proof** β€” The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code β€” without recognizing the concurrency issue β€” scores at most 0.40 and fails the most important grader component.
436
+
437
+ All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning.
438
+
439
  ---
440
 
441
+ ## Design Decisions
442
+
443
+ **Why require a hypothesis?** The `hypothesis` field is mandatory on every `submit_fix` action. Missing it costs `-0.10` and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score `hypothesis_accuracy` separately from `test_pass_ratio`. It also prevents degenerate strategies of submitting random code until something passes.
444
 
445
+ **Why is `best_tests_passed` calculated from agent attempts only?** The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the `attempts` list β€” which contains only what the agent actually submitted β€” ensuring the score floor is 0.0.
446
 
447
+ **Why run the concurrent stress test 3 times?** Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 β€” partial credit for progress, but not full credit.
448
+
449
+ **Why not use pytest directly?** Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms.
 
450
 
451
  ---
452
 
453
+ ## Environment Configuration
454
+
455
+ ```bash
456
+ # Required for inference.py
457
+ API_BASE_URL # LLM API endpoint (e.g. https://api.openai.com/v1)
458
+ MODEL_NAME # Model identifier (e.g. gpt-4o)
459
+ HF_TOKEN # API key / HuggingFace token
460
+
461
+ # Optional β€” defaults to localhost:8000
462
+ ENV_BASE_URL # Environment server URL
463
+ ```
464
+
465
+ ---
466
+
467
+ ## License & Attribution
468
+
469
+ **License:** MIT β€” see [LICENSE](LICENSE)
470
+
471
+ **Author:** Shashaank
472
+
473
+ **Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon
474
+
475
+ **Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env