Spaces:
Sleeping
Sleeping
databoysu commited on
Commit ·
7266968
1
Parent(s): f469c8e
active graders
Browse files- README.md +11 -1
- server/graders.py +115 -32
README.md
CHANGED
|
@@ -53,12 +53,14 @@ Every task contains: `name`, `description`, `difficulty`, `bug_type`, `code` (bu
|
|
| 53 |
## Tech Stack & Project Files
|
| 54 |
|
| 55 |
This environment enforces strict typing and uses standard modern tooling:
|
|
|
|
| 56 |
- **`uv`:** Handles dependency management (see `pyproject.toml`).
|
| 57 |
- **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance.
|
| 58 |
- **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`).
|
| 59 |
- **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`.
|
| 60 |
|
| 61 |
**File Layout:**
|
|
|
|
| 62 |
- `models.py` / `context.py`: Domain and schema logic.
|
| 63 |
- `tasks.py`: Task metadata definitions.
|
| 64 |
- `sandbox.py`: Subprocess runtime and output tracking.
|
|
@@ -79,6 +81,7 @@ uv run --project . server
|
|
| 79 |
```
|
| 80 |
|
| 81 |
Server endpoints available:
|
|
|
|
| 82 |
- `POST /reset`
|
| 83 |
- `POST /step`
|
| 84 |
- `GET /health`
|
|
@@ -91,7 +94,7 @@ The current environment intentionally squashes scores into the open interval `[0
|
|
| 91 |
reported with that convention in mind.
|
| 92 |
|
| 93 |
| Task | Baseline Score |
|
| 94 |
-
|---
|
| 95 |
| `valid_parentheses_wrong_mapping` | Pending first benchmark run |
|
| 96 |
| `binary_search_off_by_one` | Pending first benchmark run |
|
| 97 |
| `reverse_string_returns_original` | Pending first benchmark run |
|
|
@@ -101,12 +104,14 @@ reported with that convention in mind.
|
|
| 101 |
The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance.
|
| 102 |
|
| 103 |
### Testing Locally in Docker
|
|
|
|
| 104 |
```bash
|
| 105 |
docker build -t tracefix-rl:test -f Dockerfile .
|
| 106 |
docker run --rm -p 7860:7860 tracefix-rl:test
|
| 107 |
```
|
| 108 |
|
| 109 |
### Deploy to Hugging Face Spaces
|
|
|
|
| 110 |
This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.
|
| 111 |
|
| 112 |
```bash
|
|
@@ -115,7 +120,9 @@ openenv push
|
|
| 115 |
```
|
| 116 |
|
| 117 |
### Server Pre-validation
|
|
|
|
| 118 |
Before committing to training, you can validate your deployed server or local space:
|
|
|
|
| 119 |
```bash
|
| 120 |
bash ./pre-val.sh https://<your-space>.hf.space .
|
| 121 |
```
|
|
@@ -125,15 +132,18 @@ bash ./pre-val.sh https://<your-space>.hf.space .
|
|
| 125 |
The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.
|
| 126 |
|
| 127 |
**Requirements for Inference:**
|
|
|
|
| 128 |
- `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`)
|
| 129 |
- `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`)
|
| 130 |
- `HF_TOKEN`
|
| 131 |
|
| 132 |
**Usage Flags:**
|
|
|
|
| 133 |
- `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket.
|
| 134 |
- `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities.
|
| 135 |
|
| 136 |
Example execution tracking thoughts in medium tasks:
|
|
|
|
| 137 |
```bash
|
| 138 |
python inference.py --medium --thought
|
| 139 |
```
|
|
|
|
| 53 |
## Tech Stack & Project Files
|
| 54 |
|
| 55 |
This environment enforces strict typing and uses standard modern tooling:
|
| 56 |
+
|
| 57 |
- **`uv`:** Handles dependency management (see `pyproject.toml`).
|
| 58 |
- **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance.
|
| 59 |
- **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`).
|
| 60 |
- **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`.
|
| 61 |
|
| 62 |
**File Layout:**
|
| 63 |
+
|
| 64 |
- `models.py` / `context.py`: Domain and schema logic.
|
| 65 |
- `tasks.py`: Task metadata definitions.
|
| 66 |
- `sandbox.py`: Subprocess runtime and output tracking.
|
|
|
|
| 81 |
```
|
| 82 |
|
| 83 |
Server endpoints available:
|
| 84 |
+
|
| 85 |
- `POST /reset`
|
| 86 |
- `POST /step`
|
| 87 |
- `GET /health`
|
|
|
|
| 94 |
reported with that convention in mind.
|
| 95 |
|
| 96 |
| Task | Baseline Score |
|
| 97 |
+
| --- | --- |
|
| 98 |
| `valid_parentheses_wrong_mapping` | Pending first benchmark run |
|
| 99 |
| `binary_search_off_by_one` | Pending first benchmark run |
|
| 100 |
| `reverse_string_returns_original` | Pending first benchmark run |
|
|
|
|
| 104 |
The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance.
|
| 105 |
|
| 106 |
### Testing Locally in Docker
|
| 107 |
+
|
| 108 |
```bash
|
| 109 |
docker build -t tracefix-rl:test -f Dockerfile .
|
| 110 |
docker run --rm -p 7860:7860 tracefix-rl:test
|
| 111 |
```
|
| 112 |
|
| 113 |
### Deploy to Hugging Face Spaces
|
| 114 |
+
|
| 115 |
This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.
|
| 116 |
|
| 117 |
```bash
|
|
|
|
| 120 |
```
|
| 121 |
|
| 122 |
### Server Pre-validation
|
| 123 |
+
|
| 124 |
Before committing to training, you can validate your deployed server or local space:
|
| 125 |
+
|
| 126 |
```bash
|
| 127 |
bash ./pre-val.sh https://<your-space>.hf.space .
|
| 128 |
```
|
|
|
|
| 132 |
The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.
|
| 133 |
|
| 134 |
**Requirements for Inference:**
|
| 135 |
+
|
| 136 |
- `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`)
|
| 137 |
- `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`)
|
| 138 |
- `HF_TOKEN`
|
| 139 |
|
| 140 |
**Usage Flags:**
|
| 141 |
+
|
| 142 |
- `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket.
|
| 143 |
- `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities.
|
| 144 |
|
| 145 |
Example execution tracking thoughts in medium tasks:
|
| 146 |
+
|
| 147 |
```bash
|
| 148 |
python inference.py --medium --thought
|
| 149 |
```
|
server/graders.py
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
"""Task graders for TraceFix-RL.
|
| 2 |
|
| 3 |
The online validator expects importable grader callables for each task entry.
|
| 4 |
-
These graders
|
| 5 |
-
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
|
@@ -10,17 +10,13 @@ from __future__ import annotations
|
|
| 10 |
from collections.abc import Mapping, Sequence
|
| 11 |
from typing import Any, Optional
|
| 12 |
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
MIN_SCORE = 0.01
|
| 15 |
MAX_SCORE = 0.98
|
| 16 |
|
| 17 |
-
_TASK_BASELINES = {
|
| 18 |
-
"valid_parentheses_wrong_mapping": 0.18,
|
| 19 |
-
"binary_search_off_by_one": 0.24,
|
| 20 |
-
"reverse_string_returns_original": 0.12,
|
| 21 |
-
}
|
| 22 |
-
|
| 23 |
-
|
| 24 |
def _clamp(score: float) -> float:
|
| 25 |
return round(min(max(score, MIN_SCORE), MAX_SCORE), 4)
|
| 26 |
|
|
@@ -74,46 +70,133 @@ def _find_score_value(payload: Any) -> Optional[float]:
|
|
| 74 |
return None
|
| 75 |
|
| 76 |
|
| 77 |
-
def
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
mapping = _as_mapping(payload)
|
| 81 |
-
action_history = None
|
| 82 |
if mapping is not None:
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
action_count = sum(1 for _ in payload)
|
| 92 |
-
baseline += min(0.20, action_count * 0.01)
|
| 93 |
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
|
| 97 |
def grade(payload: Any = None, *args: Any, task_name: str = "", **kwargs: Any) -> float:
|
| 98 |
-
"""
|
| 99 |
|
| 100 |
if payload is None and args:
|
| 101 |
payload = args[0]
|
| 102 |
|
| 103 |
-
for candidate in (payload, kwargs):
|
| 104 |
-
if candidate is None:
|
| 105 |
-
continue
|
| 106 |
-
score = _find_score_value(candidate)
|
| 107 |
-
if score is not None:
|
| 108 |
-
return _clamp(score)
|
| 109 |
-
|
| 110 |
if not task_name:
|
| 111 |
task_name = str(kwargs.get("task_id") or kwargs.get("name") or "")
|
| 112 |
|
| 113 |
if task_name:
|
| 114 |
-
|
|
|
|
| 115 |
|
| 116 |
-
return
|
| 117 |
|
| 118 |
|
| 119 |
def grade_valid_parentheses_wrong_mapping(*args: Any, **kwargs: Any) -> float:
|
|
|
|
| 1 |
"""Task graders for TraceFix-RL.
|
| 2 |
|
| 3 |
The online validator expects importable grader callables for each task entry.
|
| 4 |
+
These graders execute the real task tests against the final code state so the
|
| 5 |
+
judge can verify actual solution quality instead of a canned lookup.
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
|
|
|
| 10 |
from collections.abc import Mapping, Sequence
|
| 11 |
from typing import Any, Optional
|
| 12 |
|
| 13 |
+
from core.sandbox import run_code_with_tests
|
| 14 |
+
from tasks.tasks import ALL_TASKS
|
| 15 |
+
|
| 16 |
|
| 17 |
MIN_SCORE = 0.01
|
| 18 |
MAX_SCORE = 0.98
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
def _clamp(score: float) -> float:
|
| 21 |
return round(min(max(score, MIN_SCORE), MAX_SCORE), 4)
|
| 22 |
|
|
|
|
| 70 |
return None
|
| 71 |
|
| 72 |
|
| 73 |
+
def _find_task(task_name: str) -> Optional[dict[str, Any]]:
|
| 74 |
+
for task in ALL_TASKS:
|
| 75 |
+
if task.get("name") == task_name:
|
| 76 |
+
return task
|
| 77 |
+
return None
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def _extract_final_observation(payload: Any) -> Any:
|
| 81 |
+
if payload is None:
|
| 82 |
+
return None
|
| 83 |
|
| 84 |
mapping = _as_mapping(payload)
|
|
|
|
| 85 |
if mapping is not None:
|
| 86 |
+
for key in ("final_observation", "observation", "state", "last_observation"):
|
| 87 |
+
if key in mapping:
|
| 88 |
+
candidate = mapping.get(key)
|
| 89 |
+
if candidate is not None:
|
| 90 |
+
nested = _extract_final_observation(candidate)
|
| 91 |
+
if nested is not None:
|
| 92 |
+
return nested
|
| 93 |
+
if "trajectory" in mapping:
|
| 94 |
+
return _extract_final_observation(mapping.get("trajectory"))
|
| 95 |
+
return payload
|
| 96 |
+
|
| 97 |
+
if isinstance(payload, Sequence) and not isinstance(payload, (str, bytes, bytearray)):
|
| 98 |
+
if not payload:
|
| 99 |
+
return None
|
| 100 |
+
last_item = payload[-1]
|
| 101 |
+
if isinstance(last_item, Sequence) and not isinstance(last_item, (str, bytes, bytearray)) and len(last_item) >= 2:
|
| 102 |
+
return _extract_final_observation(last_item[1])
|
| 103 |
+
if isinstance(last_item, Mapping) or hasattr(last_item, "model_dump") or hasattr(last_item, "dict"):
|
| 104 |
+
return _extract_final_observation(last_item)
|
| 105 |
+
return last_item
|
| 106 |
+
|
| 107 |
+
return payload
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def _observation_to_source(observation: Any) -> Optional[str]:
|
| 111 |
+
if observation is None:
|
| 112 |
+
return None
|
| 113 |
+
|
| 114 |
+
mapping = _as_mapping(observation)
|
| 115 |
+
if mapping is not None:
|
| 116 |
+
source = mapping.get("source")
|
| 117 |
+
if isinstance(source, str) and source.strip():
|
| 118 |
+
return source
|
| 119 |
+
|
| 120 |
+
code_lines = mapping.get("code_lines") or mapping.get("code")
|
| 121 |
+
if isinstance(code_lines, Sequence) and not isinstance(code_lines, (str, bytes, bytearray)):
|
| 122 |
+
lines = [str(line) for line in code_lines]
|
| 123 |
+
return "\n".join(lines)
|
| 124 |
+
|
| 125 |
+
code_dict = mapping.get("code_dict")
|
| 126 |
+
if isinstance(code_dict, Mapping) and code_dict:
|
| 127 |
+
ordered_lines: list[tuple[int, str]] = []
|
| 128 |
+
for key, value in code_dict.items():
|
| 129 |
+
try:
|
| 130 |
+
line_no = int(key)
|
| 131 |
+
except Exception:
|
| 132 |
+
continue
|
| 133 |
+
ordered_lines.append((line_no, str(value)))
|
| 134 |
+
if ordered_lines:
|
| 135 |
+
ordered_lines.sort(key=lambda item: item[0])
|
| 136 |
+
return "\n".join(line for _, line in ordered_lines)
|
| 137 |
+
|
| 138 |
+
for attr in ("source", "code", "code_lines", "code_dict"):
|
| 139 |
+
if hasattr(observation, attr):
|
| 140 |
+
value = getattr(observation, attr)
|
| 141 |
+
if isinstance(value, str) and value.strip():
|
| 142 |
+
return value
|
| 143 |
+
if isinstance(value, Sequence) and not isinstance(value, (str, bytes, bytearray)):
|
| 144 |
+
return "\n".join(str(line) for line in value)
|
| 145 |
+
if isinstance(value, Mapping) and value:
|
| 146 |
+
ordered_lines = []
|
| 147 |
+
for key, line in value.items():
|
| 148 |
+
try:
|
| 149 |
+
ordered_lines.append((int(key), str(line)))
|
| 150 |
+
except Exception:
|
| 151 |
+
continue
|
| 152 |
+
if ordered_lines:
|
| 153 |
+
ordered_lines.sort(key=lambda item: item[0])
|
| 154 |
+
return "\n".join(line for _, line in ordered_lines)
|
| 155 |
+
|
| 156 |
+
return None
|
| 157 |
+
|
| 158 |
|
| 159 |
+
def _evaluate_task(task_name: str, payload: Any) -> float:
|
| 160 |
+
task = _find_task(task_name)
|
| 161 |
+
if task is None:
|
| 162 |
+
return MIN_SCORE
|
|
|
|
|
|
|
| 163 |
|
| 164 |
+
final_observation = _extract_final_observation(payload)
|
| 165 |
+
source = _observation_to_source(final_observation)
|
| 166 |
+
if not source or not source.strip():
|
| 167 |
+
return MIN_SCORE
|
| 168 |
+
|
| 169 |
+
try:
|
| 170 |
+
_, results, syntax_err = run_code_with_tests(
|
| 171 |
+
source=source,
|
| 172 |
+
test_callables=task["tests"],
|
| 173 |
+
)
|
| 174 |
+
except Exception:
|
| 175 |
+
return MIN_SCORE
|
| 176 |
+
|
| 177 |
+
if syntax_err:
|
| 178 |
+
return MIN_SCORE
|
| 179 |
+
|
| 180 |
+
if results and all(test_result.passed for test_result in results):
|
| 181 |
+
return MAX_SCORE
|
| 182 |
+
|
| 183 |
+
return MIN_SCORE
|
| 184 |
|
| 185 |
|
| 186 |
def grade(payload: Any = None, *args: Any, task_name: str = "", **kwargs: Any) -> float:
|
| 187 |
+
"""Execute the task's real tests against the final code state."""
|
| 188 |
|
| 189 |
if payload is None and args:
|
| 190 |
payload = args[0]
|
| 191 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
if not task_name:
|
| 193 |
task_name = str(kwargs.get("task_id") or kwargs.get("name") or "")
|
| 194 |
|
| 195 |
if task_name:
|
| 196 |
+
active_payload = payload if payload is not None else kwargs
|
| 197 |
+
return _evaluate_task(task_name, active_payload)
|
| 198 |
|
| 199 |
+
return MIN_SCORE
|
| 200 |
|
| 201 |
|
| 202 |
def grade_valid_parentheses_wrong_mapping(*args: Any, **kwargs: Any) -> float:
|