shank commited on
Commit
5c507c3
Β·
1 Parent(s): 3548cd0

Update: Even more Final README.md update

Browse files
Files changed (2) hide show
  1. README.md +4 -4
  2. openenv.yaml +1 -1
README.md CHANGED
@@ -123,7 +123,7 @@ The agent must: recognize that 8/8 passing tests do not prove correctness for co
123
 
124
  **Hard task grader breakdown:**
125
  - Sequential tests pass (agent submissions only): **0.40**
126
- - 1000-thread concurrent stress test passes (run 3Γ—, must pass all 3): **0.30**
127
  - Hypothesis accuracy (mentions "race condition", "atomic", "lock"): **0.20**
128
  - Efficiency bonus (fixed within 5 attempts): **0.10**
129
 
@@ -170,9 +170,9 @@ early_solve_bonus = 0.05 if solved within ceil(max_attempts / 3) attempts
170
 
171
  Every `submit_fix` action executes agent-generated Python code. All execution routes through `env/sandbox.py` β€” never via raw `exec()` anywhere in the codebase.
172
 
173
- **Layer 1 β€” AST Import Filtering:** Before execution, an AST walk detects blocked imports (`os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `pickle`, `ctypes`, `multiprocessing`, and others). Uses `ast.parse()` + `ast.walk()` β€” not string matching, which can be bypassed.
174
 
175
- **Layer 2 β€” Subprocess Isolation:** Code runs in a child subprocess with a stripped environment. Even if the AST filter is bypassed, the subprocess cannot affect the server process.
176
 
177
  **Layer 3 β€” Hard Timeout:** Every execution killed after 10 seconds. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
178
 
@@ -342,7 +342,7 @@ AgentDebuggerEnv/
342
 
343
  **Why recalculate `test_pass_ratio` from the attempts list?** The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run at reset), a dummy agent that submits nothing would score 0.36 and 0.40 for free. Recalculating from the `attempts` list guarantees the score floor is 0.0.
344
 
345
- **Why run the concurrent stress test 3 times?** Race conditions are non-deterministic. A partial fix that narrows the race window may pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. Passing 1 of 3 gives 0.15 β€” partial credit for progress, not full credit.
346
 
347
  **Why not use pytest directly?** Using pytest as the test runner makes output parsing dependent on pytest's version and output format. The environment uses a lightweight custom test runner embedded as a Python string, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms and environments.
348
 
 
123
 
124
  **Hard task grader breakdown:**
125
  - Sequential tests pass (agent submissions only): **0.40**
126
+ - 1000-thread concurrent stress test passes (run 5Γ—, must pass >=4 for full credit): **0.30**
127
  - Hypothesis accuracy (mentions "race condition", "atomic", "lock"): **0.20**
128
  - Efficiency bonus (fixed within 5 attempts): **0.10**
129
 
 
170
 
171
  Every `submit_fix` action executes agent-generated Python code. All execution routes through `env/sandbox.py` β€” never via raw `exec()` anywhere in the codebase.
172
 
173
+ **Layer 1 β€” AST Import & Attribute Filtering:** Before execution, an AST walk detects blocked imports and prevents access to any attribute starting with an underscore (`_`). This blocks private member access and dunder escapes (like `__class__`).
174
 
175
+ **Layer 2 β€” Subprocess Isolation:** Code runs in a child subprocess with a stripped environment and no network access.
176
 
177
  **Layer 3 β€” Hard Timeout:** Every execution killed after 10 seconds. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
178
 
 
342
 
343
  **Why recalculate `test_pass_ratio` from the attempts list?** The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run at reset), a dummy agent that submits nothing would score 0.36 and 0.40 for free. Recalculating from the `attempts` list guarantees the score floor is 0.0.
344
 
345
+ **Why run the concurrent stress test 5 times?** Race conditions are non-deterministic. A partial fix that narrows the race window may pass once by luck. Requiring 4 of 5 runs to pass provides a robust statistical threshold that filters out lucky partial fixes while allowing for minor runner jitter. Passing 2 of 5 gives 0.15 β€” partial credit for progress.
346
 
347
  **Why not use pytest directly?** Using pytest as the test runner makes output parsing dependent on pytest's version and output format. The environment uses a lightweight custom test runner embedded as a Python string, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms and environments.
348
 
openenv.yaml CHANGED
@@ -54,7 +54,7 @@ baseline:
54
  medium: 0.50
55
  hard: 0.18
56
  author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
57
- # Submission Integrity: SHA 159a5faf82fc1ab3709f9674becf9a3ec55cf562 | Verified 2026-04-08
58
  license: MIT
59
  huggingface_space: shashaank0707/AgentDebugger-env
60
  api_base_url_env_var: API_BASE_URL
 
54
  medium: 0.50
55
  hard: 0.18
56
  author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
57
+ # Submission Integrity: SHA e93446da6e57b3f582db65a947dc0abef18e66c6 | Verified 2026-04-09
58
  license: MIT
59
  huggingface_space: shashaank0707/AgentDebugger-env
60
  api_base_url_env_var: API_BASE_URL