Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

App Files Files Community

shank commited on 27 days ago

Commit

159a5fa

1 Parent(s): 0769caa

Update: Made refinements to the project

Browse files

Files changed (15) hide show

Dockerfile +1 -1
README.md +28 -3
env/__pycache__/environment.cpython-310.pyc +0 -0
env/__pycache__/models.cpython-310.pyc +0 -0
env/__pycache__/sandbox.cpython-310.pyc +0 -0
env/environment.py +18 -1
env/graders/__pycache__/base_grader.cpython-310.pyc +0 -0
env/graders/__pycache__/grader_hard.cpython-310.pyc +0 -0
env/graders/grader_hard.py +13 -114
env/sandbox.py +1 -1
inference.py +5 -4
requirements.txt +1 -0
tests/__pycache__/test_environment.cpython-310-pytest-8.1.0.pyc +0 -0
tests/__pycache__/test_sandbox.cpython-310-pytest-8.1.0.pyc +0 -0
tests/test_integration.py +102 -0

Dockerfile CHANGED Viewed

@@ -16,7 +16,7 @@ COPY . .
 EXPOSE 8000
 # Health check — hackathon automated ping requires this to return 200
-HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
     CMD curl -f http://localhost:8000/health || exit 1
 # Single worker — environment is 2vCPU, multi-worker causes resource issues

 EXPOSE 8000
 # Health check — hackathon automated ping requires this to return 200
+HEALTHCHECK --interval=30s --timeout=15s --start-period=15s --retries=3 \
     CMD curl -f http://localhost:8000/health || exit 1
 # Single worker — environment is 2vCPU, multi-worker causes resource issues

README.md CHANGED Viewed

@@ -17,6 +17,19 @@ license: mit
 ---
 ## 🚀 The Core Philosophy
 Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.
@@ -33,9 +46,9 @@ AgentDebuggerEnv forces agents to operate in a **live feedback loop**:
 ### 1. Robust Security Sandbox
 Every submission is executed in a multi-layered isolated environment:
-*   **AST Filtering**: An Abstract Syntax Tree (AST) pass blocks dangerous imports (`os`, `sys`, `subprocess`, etc.) and builtins before execution.
-*   **Process Isolation**: Executes in a separate subprocess with hard memory (256MB) and time (10s) limits.
-*   **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests for identifying race conditions while maintaining host security.
 ### 2. High-Fidelity Feedback
 Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
@@ -92,6 +105,18 @@ python inference.py
 ---
 ## 🔗 OpenEnv API Compliance
 AgentDebuggerEnv implements the full OpenEnv specification:

 ---
+## 📊 Baseline Performance
+Tested with **GPT-4o** using the standard `inference.py` script:
+- **Easy (0.85)**: Solved in 1-2 attempts; clear signal from error output.
+- **Medium (0.50)**: Solved in ~4 attempts; agents must resist a red-herring authentication error.
+- **Hard (0.18)**: Rarely solved; agents must proactively design concurrent tests to surface the hidden race condition.
+- **Mean Score: 0.51**
+*Measurements taken over multiple runs to account for LLM variance. See `openenv.yaml` for full metadata.*
+---
 ## 🚀 The Core Philosophy
 Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.
 ### 1. Robust Security Sandbox
 Every submission is executed in a multi-layered isolated environment:
+*   **AST Filtering**: A deep Abstract Syntax Tree (AST) pass analyzes submitted code before execution, blocking dangerous imports (`os`, `sys`, `subprocess`, `socket`, etc.) and preventing the override of security-critical builtins.
+*   **Process Isolation**: Executes in a separate subprocess with hard memory (256MB) and time (15s) limits. Any attempt to hang the environment results in immediate termination.
+*   **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests (essential for the Hard Task) while maintaining strict host-level security boundaries.
 ### 2. High-Fidelity Feedback
 Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
 ---
+### 🔐 Environment Variables
+| Variable | Description | Standard Fallback |
+| :--- | :--- | :--- |
+| `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` |
+| `MODEL_NAME` | Model to evaluate | `gpt-4o` |
+| `HF_TOKEN` | Auth token (or OpenAI key) | — |
+| `OPENAI_API_KEY` | Alternative auth token | — |
+| `ENV_BASE_URL` | Address of the FastAPI server | `http://localhost:8000` |
+---
 ## 🔗 OpenEnv API Compliance
 AgentDebuggerEnv implements the full OpenEnv specification:

env/__pycache__/environment.cpython-310.pyc CHANGED Viewed

Binary files a/env/__pycache__/environment.cpython-310.pyc and b/env/__pycache__/environment.cpython-310.pyc differ

env/__pycache__/models.cpython-310.pyc CHANGED Viewed

Binary files a/env/__pycache__/models.cpython-310.pyc and b/env/__pycache__/models.cpython-310.pyc differ

env/__pycache__/sandbox.cpython-310.pyc CHANGED Viewed

Binary files a/env/__pycache__/sandbox.cpython-310.pyc and b/env/__pycache__/sandbox.cpython-310.pyc differ

env/environment.py CHANGED Viewed

@@ -249,7 +249,7 @@ class DebuggerEnvironment:
     def _handle_query_context(self, action: Action) -> Dict[str, Any]:
         """Handle query_context action."""
-        valid_query_types = ["function_signature", "related_code", "error_explanation", "test_details"]
         if action.query_type not in valid_query_types:
             return self._make_response(
@@ -511,5 +511,22 @@ class DebuggerEnvironment:
                     return f"Test details for '{query_target}':\n" + "\n".join(relevant)
             return f"Full test suite:\n{test_suite}"
         return "No information available for this query."

     def _handle_query_context(self, action: Action) -> Dict[str, Any]:
         """Handle query_context action."""
+        valid_query_types = ["function_signature", "related_code", "error_explanation", "test_details", "test_suggestion"]
         if action.query_type not in valid_query_types:
             return self._make_response(
                     return f"Test details for '{query_target}':\n" + "\n".join(relevant)
             return f"Full test suite:\n{test_suite}"
+        elif query_type == "test_suggestion":
+            # Provide a specific hint for the hard task if they ask
+            if task["task_id"] == "hard":
+                return (
+                    "HINT: The sequential tests pass, but have you considered testing with "
+                    "concurrent threads? There might be a race condition that only appears "
+                    "under load. Try writing a test that uses 'threading' to call methods "
+                    "simultaneously."
+                )
+            elif task["task_id"] == "medium":
+                return (
+                    "HINT: Don't trust the first error message you see. Trace the data flow "
+                    "backwards to see where the invalid input was actually generated."
+                )
+            else:
+                return "HINT: Look closely at the comparison operators and loop boundaries."
         return "No information available for this query."

env/graders/__pycache__/base_grader.cpython-310.pyc CHANGED Viewed

Binary files a/env/graders/__pycache__/base_grader.cpython-310.pyc and b/env/graders/__pycache__/base_grader.cpython-310.pyc differ

env/graders/__pycache__/grader_hard.cpython-310.pyc CHANGED Viewed

Binary files a/env/graders/__pycache__/grader_hard.cpython-310.pyc and b/env/graders/__pycache__/grader_hard.cpython-310.pyc differ

env/graders/grader_hard.py CHANGED Viewed

@@ -1,105 +1,3 @@
-# """
-# Grader Hard — Concurrent stress test scoring.
-# Custom weights:
-#   0.40 — original 8 tests pass
-#   0.30 — concurrent stress test (1000 threads)
-#   0.20 — hypothesis accuracy
-#   0.10 — efficiency bonus (solved within 5 attempts)
-# """
-# import threading
-# from typing import List, Dict, Any
-# from env.graders.base_grader import BaseGrader
-# class HardGrader(BaseGrader):
-#     def _run_concurrent_stress_test(self, code: str) -> bool:
-#         """
-#         Run a 1000-thread concurrent stress test against the submitted code.
-#         Returns True if the counter ends at exactly 1000 after 1000 concurrent increments.
-#         """
-#         try:
-#             # Execute the code in an isolated namespace
-#             namespace = {}
-#             exec(code, namespace)
-#             CounterClass = namespace.get("ConnectionCounter")
-#             if CounterClass is None:
-#                 return False
-#             counter = CounterClass()
-#             num_threads = 1000
-#             threads = [
-#                 threading.Thread(target=counter.increment)
-#                 for _ in range(num_threads)
-#             ]
-#             for t in threads:
-#                 t.start()
-#             for t in threads:
-#                 t.join(timeout=10)
-#             return counter.get_count() == num_threads
-#         except Exception:
-#             return False
-#     def score(
-#         self,
-#         task_config: dict,
-#         attempts: List[Dict[str, Any]],
-#         best_tests_passed: int,
-#         tests_total: int,
-#         attempts_used: int,
-#         max_attempts: int,
-#         hypotheses: List[str],
-#     ) -> float:
-#         ground_truth = task_config["ground_truth"]
-#         keywords = ground_truth["hypothesis_keywords"]
-#         # 1. Original tests pass (weight: 0.40)
-#         test_pass_ratio = (best_tests_passed / tests_total) if tests_total > 0 else 0.0
-#         original_test_score = test_pass_ratio * 0.40
-#         # 2. Concurrent stress test (weight: 0.30)
-#         # Use the best attempt's code (highest tests_passed, then latest)
-#         concurrent_score = 0.0
-#         if attempts:
-#             # Find the best attempt
-#             best_attempt = max(
-#                 attempts,
-#                 key=lambda a: (a.get("tests_passed", 0), a.get("attempt_number", 0))
-#             )
-#             best_code = best_attempt.get("code_submitted", "")
-#             if best_code:
-#                 # Run the stress test 3 times — must pass all 3 for full credit
-#                 passes = sum(
-#                     1 for _ in range(3)
-#                     if self._run_concurrent_stress_test(best_code)
-#                 )
-#                 if passes == 3:
-#                     concurrent_score = 0.30
-#                 elif passes >= 1:
-#                     concurrent_score = 0.15  # Partial — inconsistent fix
-#         # 3. Hypothesis accuracy (weight: 0.20)
-#         if hypotheses:
-#             matches = sum(
-#                 1 for h in hypotheses
-#                 if self._check_hypothesis_keywords(h, keywords, "any")
-#             )
-#             hypothesis_ratio = matches / len(hypotheses)
-#         else:
-#             hypothesis_ratio = 0.0
-#         hypothesis_score = hypothesis_ratio * 0.20
-#         # 4. Efficiency bonus (weight: 0.10)
-#         efficiency_score = 0.10 if attempts_used <= 5 else 0.0
-#         total = original_test_score + concurrent_score + hypothesis_score + efficiency_score
-#         return self._clamp(total)
 """
 Grader Hard — Concurrent stress test scoring.
@@ -141,17 +39,18 @@ result = counter.get_count()
 assert result == num_threads, f"CONCURRENT FAIL: expected {num_threads}, got {result}"
 print(f"CONCURRENT PASS: {result} == {num_threads}")
 """
 class HardGrader(BaseGrader):
     def _run_concurrent_stress_test(self, code: str) -> bool:
         """
         Run the concurrent stress test against agent-submitted code.
         Routes through execute_code() sandbox — never uses raw exec().
-        Returns True only if the counter reaches exactly 1000 after
         1000 concurrent increments.
         """
         output, timed_out, _ = execute_code(
-            code,
             _CONCURRENT_STRESS_TEST,
             allow_threading=True,
         )
@@ -174,8 +73,8 @@ class HardGrader(BaseGrader):
         # ── 1. Sequential test score (weight: 0.40) ──────────────────────────
         # IMPORTANT: Only count agent-submitted attempts, NOT the initial buggy
-        # code. The buggy code passes all 8 sequential tests — if we used
-        # best_tests_passed from environment state, every agent would score
         # 0.40 for free without fixing anything. We recalculate from attempts.
         if attempts:
             agent_best_sequential = max(
@@ -189,8 +88,8 @@ class HardGrader(BaseGrader):
         # ── 2. Concurrent stress test (weight: 0.30) ──────────────────────────
         # Use the best attempt by sequential test count (ties broken by recency).
-        # Run the stress test 3 times — must pass all 3 for full credit,
-        # at least 1 for partial credit. This handles non-determinism fairly.
         concurrent_score = 0.0
         if attempts:
             best_attempt = max(
@@ -201,14 +100,14 @@ class HardGrader(BaseGrader):
             if best_code:
                 passes = sum(
-                    1 for _ in range(3)
                     if self._run_concurrent_stress_test(best_code)
                 )
-                if passes == 3:
-                    concurrent_score = 0.30       # Fully correct fix
-                elif passes >= 1:
-                    concurrent_score = 0.15       # Partially correct — inconsistent
         # ── 3. Hypothesis accuracy (weight: 0.20) ─────────────────────────────
         if hypotheses:
             matches = sum(

 """
 Grader Hard — Concurrent stress test scoring.
 assert result == num_threads, f"CONCURRENT FAIL: expected {num_threads}, got {result}"
 print(f"CONCURRENT PASS: {result} == {num_threads}")
 """
 class HardGrader(BaseGrader):
     def _run_concurrent_stress_test(self, code: str) -> bool:
         """
         Run the concurrent stress test against agent-submitted code.
         Routes through execute_code() sandbox — never uses raw exec().
+        Returns True only if the counter reaches exactly 1000 after
         1000 concurrent increments.
         """
         output, timed_out, _ = execute_code(
+            code,
             _CONCURRENT_STRESS_TEST,
             allow_threading=True,
         )
         # ── 1. Sequential test score (weight: 0.40) ──────────────────────────
         # IMPORTANT: Only count agent-submitted attempts, NOT the initial buggy
+        # code. The buggy code passes all 8 sequential tests — if we used
+        # best_tests_passed from environment state, every agent would score
         # 0.40 for free without fixing anything. We recalculate from attempts.
         if attempts:
             agent_best_sequential = max(
         # ── 2. Concurrent stress test (weight: 0.30) ──────────────────────────
         # Use the best attempt by sequential test count (ties broken by recency).
+        # Run the stress test 5 times — must pass 4/5 for full credit,
+        # at least 2/5 for partial credit. This handles non-determinism robustly.
         concurrent_score = 0.0
         if attempts:
             best_attempt = max(
             if best_code:
                 passes = sum(
+                    1 for _ in range(5)
                     if self._run_concurrent_stress_test(best_code)
                 )
+                if passes >= 4:
+                    concurrent_score = 0.30       # Robustly fixed
+                elif passes >= 2:
+                    concurrent_score = 0.15       # Partially fixed / Flaky
         # ── 3. Hypothesis accuracy (weight: 0.20) ─────────────────────────────
         if hypotheses:
             matches = sum(

env/sandbox.py CHANGED Viewed

@@ -21,7 +21,7 @@ BLOCKED_IMPORTS = [
     "ctypes", "cffi", "resource", "signal", "mmap", "gc"
 ]
-EXECUTION_TIMEOUT_SECONDS = 10
 MEMORY_LIMIT_MB = 256

     "ctypes", "cffi", "resource", "signal", "mmap", "gc"
 ]
+EXECUTION_TIMEOUT_SECONDS = 15
 MEMORY_LIMIT_MB = 256

inference.py CHANGED Viewed

@@ -21,10 +21,10 @@ import requests
 # ── Environment variables (never hardcode these) ──────────────────────────────
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME", "gpt-4o")
-HF_TOKEN     = os.environ.get("HF_TOKEN", "")
 ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:8000")
-client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
 SYSTEM_PROMPT = """You are an expert software debugger. You will be given broken code and a
 failing test suite. Your job is to:
@@ -171,7 +171,8 @@ def run_episode(task_id: str) -> dict:
     obs = reset_resp.json()
     # [START] task=NAME
-    print(f"[START] task={task_id}", flush=True)
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
@@ -215,7 +216,7 @@ def run_episode(task_id: str) -> dict:
         last_result = result
         # [STEP] step=N reward=R
-        print(f"[STEP] step={obs['step_number']} reward={reward['step_reward']}", flush=True)
         # Build context for next LLM call
         step_msg = build_step_message(obs, reward, info)

 # ── Environment variables (never hardcode these) ──────────────────────────────
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME", "gpt-4o")
+HF_TOKEN     = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY", "")
 ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:8000")
+client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN or "EMPTY")
 SYSTEM_PROMPT = """You are an expert software debugger. You will be given broken code and a
 failing test suite. Your job is to:
     obs = reset_resp.json()
     # [START] task=NAME
+    print(f"\n[START] task={task_id}", flush=True)
+    print(f"  Description: {obs['task_description'][:100]}...", flush=True)
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         last_result = result
         # [STEP] step=N reward=R
+        print(f"  [STEP {obs['step_number']}] Action: {action.get('action_type')} | Tests: {obs['tests_passed']}/{obs['tests_total']} | Reward: {reward['step_reward']:+.3f}", flush=True)
         # Build context for next LLM call
         step_msg = build_step_message(obs, reward, info)

requirements.txt CHANGED Viewed

@@ -7,3 +7,4 @@ python-dotenv==1.0.1
 pytest==8.1.0
 httpx==0.27.0
 RestrictedPython==7.0

 pytest==8.1.0
 httpx==0.27.0
 RestrictedPython==7.0
+openenv-core>=0.2.0

tests/__pycache__/test_environment.cpython-310-pytest-8.1.0.pyc CHANGED Viewed

Binary files a/tests/__pycache__/test_environment.cpython-310-pytest-8.1.0.pyc and b/tests/__pycache__/test_environment.cpython-310-pytest-8.1.0.pyc differ

tests/__pycache__/test_sandbox.cpython-310-pytest-8.1.0.pyc CHANGED Viewed

Binary files a/tests/__pycache__/test_sandbox.cpython-310-pytest-8.1.0.pyc and b/tests/__pycache__/test_sandbox.cpython-310-pytest-8.1.0.pyc differ

tests/test_integration.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""
+AgentDebuggerEnv — Integration Tests
+====================================
+Verifies the full episode lifecycle: reset -> step -> end.
+Assumes the server is available via the DebuggerEnvironment class directly
+(testing the logic, not the HTTP layer which is just a thin wrapper).
+"""
+import pytest
+from env.environment import DebuggerEnvironment
+from env.models import Action
+def test_full_episode_easy():
+    """Test a full successful episode on the 'easy' task."""
+    env = DebuggerEnvironment()
+    # 1. Reset
+    obs = env.reset("easy")
+    assert obs["task_id"] == "easy"
+    assert obs["done"] is False
+    assert obs["tests_passed"] < obs["tests_total"]
+    # 2. Submit a fix (using known ground truth)
+    # The easy task is binary search with 'left < right' instead of 'left <= right'
+    ground_truth_code = """
+def binary_search(arr, target):
+    left, right = 0, len(arr) - 1
+    while left <= right:
+        mid = (left + right) // 2
+        if arr[mid] == target:
+            return mid
+        elif arr[mid] < target:
+            left = mid + 1
+        else:
+            right = mid - 1
+    return -1
+"""
+    action = Action(
+        action_type="submit_fix",
+        fixed_code=ground_truth_code,
+        hypothesis="Binary search termination condition should be left <= right to include all elements."
+    )
+    result = env.step(action)
+    # 3. Verify results
+    assert result["done"] is True
+    assert result["observation"]["tests_passed"] == result["observation"]["tests_total"]
+    assert result["reward"]["grader_score"] > 0.80
+def test_query_hint_system():
+    """Test the newly added hint system."""
+    env = DebuggerEnvironment()
+    env.reset("hard")
+    action = Action(
+        action_type="query_context",
+        query_type="test_suggestion"
+    )
+    result = env.step(action)
+    assert "concurrent threads" in result["info"]["query_result"]
+    assert result["reward"]["step_reward"] == 0.0  # First query is free
+def test_hard_grader_consensus():
+    """
+    Test that the hard grader runs multiple times.
+    (We mock execute_code to simulate flakiness).
+    """
+    from unittest.mock import patch
+    from env.graders.grader_hard import HardGrader
+    grader = HardGrader()
+    # Mock execute_code to return success 3/5 times
+    # Sequence: PASS, FAIL, PASS, FAIL, PASS
+    with patch("env.graders.grader_hard.execute_code") as mock_exec:
+        mock_exec.side_effect = [
+            ("CONCURRENT PASS", False, 100),
+            ("CONCURRENT FAIL", False, 100),
+            ("CONCURRENT PASS", False, 100),
+            ("CONCURRENT FAIL", False, 100),
+            ("CONCURRENT PASS", False, 100),
+        ]
+        score = grader.score(
+            task_config={"task_id": "hard", "ground_truth": {"hypothesis_keywords": ["race"]}},
+            attempts=[{"tests_passed": 8, "attempt_number": 1, "code_submitted": "..."}],
+            best_tests_passed=8,
+            tests_total=8,
+            attempts_used=1,
+            max_attempts=10,
+            hypotheses=["race condition"]
+        )
+        # 3/5 passes → should get partial credit (0.15) for concurrency
+        # Sequential: 1.0 * 0.40 = 0.40
+        # Concurrency: 0.15
+        # Hypothesis: 1.0 * 0.20 = 0.20
+        # Efficiency: (concurrent_score == 0.30) is False -> 0.0
+        # Total: 0.75
+        assert score == 0.75