shank commited on
Commit Β·
5c507c3
1
Parent(s): 3548cd0
Update: Even more Final README.md update
Browse files- README.md +4 -4
- openenv.yaml +1 -1
README.md
CHANGED
|
@@ -123,7 +123,7 @@ The agent must: recognize that 8/8 passing tests do not prove correctness for co
|
|
| 123 |
|
| 124 |
**Hard task grader breakdown:**
|
| 125 |
- Sequential tests pass (agent submissions only): **0.40**
|
| 126 |
-
- 1000-thread concurrent stress test passes (run
|
| 127 |
- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): **0.20**
|
| 128 |
- Efficiency bonus (fixed within 5 attempts): **0.10**
|
| 129 |
|
|
@@ -170,9 +170,9 @@ early_solve_bonus = 0.05 if solved within ceil(max_attempts / 3) attempts
|
|
| 170 |
|
| 171 |
Every `submit_fix` action executes agent-generated Python code. All execution routes through `env/sandbox.py` β never via raw `exec()` anywhere in the codebase.
|
| 172 |
|
| 173 |
-
**Layer 1 β AST Import Filtering:** Before execution, an AST walk detects blocked imports
|
| 174 |
|
| 175 |
-
**Layer 2 β Subprocess Isolation:** Code runs in a child subprocess with a stripped environment
|
| 176 |
|
| 177 |
**Layer 3 β Hard Timeout:** Every execution killed after 10 seconds. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
|
| 178 |
|
|
@@ -342,7 +342,7 @@ AgentDebuggerEnv/
|
|
| 342 |
|
| 343 |
**Why recalculate `test_pass_ratio` from the attempts list?** The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run at reset), a dummy agent that submits nothing would score 0.36 and 0.40 for free. Recalculating from the `attempts` list guarantees the score floor is 0.0.
|
| 344 |
|
| 345 |
-
**Why run the concurrent stress test
|
| 346 |
|
| 347 |
**Why not use pytest directly?** Using pytest as the test runner makes output parsing dependent on pytest's version and output format. The environment uses a lightweight custom test runner embedded as a Python string, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms and environments.
|
| 348 |
|
|
|
|
| 123 |
|
| 124 |
**Hard task grader breakdown:**
|
| 125 |
- Sequential tests pass (agent submissions only): **0.40**
|
| 126 |
+
- 1000-thread concurrent stress test passes (run 5Γ, must pass >=4 for full credit): **0.30**
|
| 127 |
- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): **0.20**
|
| 128 |
- Efficiency bonus (fixed within 5 attempts): **0.10**
|
| 129 |
|
|
|
|
| 170 |
|
| 171 |
Every `submit_fix` action executes agent-generated Python code. All execution routes through `env/sandbox.py` β never via raw `exec()` anywhere in the codebase.
|
| 172 |
|
| 173 |
+
**Layer 1 β AST Import & Attribute Filtering:** Before execution, an AST walk detects blocked imports and prevents access to any attribute starting with an underscore (`_`). This blocks private member access and dunder escapes (like `__class__`).
|
| 174 |
|
| 175 |
+
**Layer 2 β Subprocess Isolation:** Code runs in a child subprocess with a stripped environment and no network access.
|
| 176 |
|
| 177 |
**Layer 3 β Hard Timeout:** Every execution killed after 10 seconds. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
|
| 178 |
|
|
|
|
| 342 |
|
| 343 |
**Why recalculate `test_pass_ratio` from the attempts list?** The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run at reset), a dummy agent that submits nothing would score 0.36 and 0.40 for free. Recalculating from the `attempts` list guarantees the score floor is 0.0.
|
| 344 |
|
| 345 |
+
**Why run the concurrent stress test 5 times?** Race conditions are non-deterministic. A partial fix that narrows the race window may pass once by luck. Requiring 4 of 5 runs to pass provides a robust statistical threshold that filters out lucky partial fixes while allowing for minor runner jitter. Passing 2 of 5 gives 0.15 β partial credit for progress.
|
| 346 |
|
| 347 |
**Why not use pytest directly?** Using pytest as the test runner makes output parsing dependent on pytest's version and output format. The environment uses a lightweight custom test runner embedded as a Python string, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms and environments.
|
| 348 |
|
openenv.yaml
CHANGED
|
@@ -54,7 +54,7 @@ baseline:
|
|
| 54 |
medium: 0.50
|
| 55 |
hard: 0.18
|
| 56 |
author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
|
| 57 |
-
# Submission Integrity: SHA
|
| 58 |
license: MIT
|
| 59 |
huggingface_space: shashaank0707/AgentDebugger-env
|
| 60 |
api_base_url_env_var: API_BASE_URL
|
|
|
|
| 54 |
medium: 0.50
|
| 55 |
hard: 0.18
|
| 56 |
author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
|
| 57 |
+
# Submission Integrity: SHA e93446da6e57b3f582db65a947dc0abef18e66c6 | Verified 2026-04-09
|
| 58 |
license: MIT
|
| 59 |
huggingface_space: shashaank0707/AgentDebugger-env
|
| 60 |
api_base_url_env_var: API_BASE_URL
|