databoysu commited on
Commit
7266968
·
1 Parent(s): f469c8e

active graders

Browse files
Files changed (2) hide show
  1. README.md +11 -1
  2. server/graders.py +115 -32
README.md CHANGED
@@ -53,12 +53,14 @@ Every task contains: `name`, `description`, `difficulty`, `bug_type`, `code` (bu
53
  ## Tech Stack & Project Files
54
 
55
  This environment enforces strict typing and uses standard modern tooling:
 
56
  - **`uv`:** Handles dependency management (see `pyproject.toml`).
57
  - **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance.
58
  - **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`).
59
  - **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`.
60
 
61
  **File Layout:**
 
62
  - `models.py` / `context.py`: Domain and schema logic.
63
  - `tasks.py`: Task metadata definitions.
64
  - `sandbox.py`: Subprocess runtime and output tracking.
@@ -79,6 +81,7 @@ uv run --project . server
79
  ```
80
 
81
  Server endpoints available:
 
82
  - `POST /reset`
83
  - `POST /step`
84
  - `GET /health`
@@ -91,7 +94,7 @@ The current environment intentionally squashes scores into the open interval `[0
91
  reported with that convention in mind.
92
 
93
  | Task | Baseline Score |
94
- |------|----------------|
95
  | `valid_parentheses_wrong_mapping` | Pending first benchmark run |
96
  | `binary_search_off_by_one` | Pending first benchmark run |
97
  | `reverse_string_returns_original` | Pending first benchmark run |
@@ -101,12 +104,14 @@ reported with that convention in mind.
101
  The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance.
102
 
103
  ### Testing Locally in Docker
 
104
  ```bash
105
  docker build -t tracefix-rl:test -f Dockerfile .
106
  docker run --rm -p 7860:7860 tracefix-rl:test
107
  ```
108
 
109
  ### Deploy to Hugging Face Spaces
 
110
  This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.
111
 
112
  ```bash
@@ -115,7 +120,9 @@ openenv push
115
  ```
116
 
117
  ### Server Pre-validation
 
118
  Before committing to training, you can validate your deployed server or local space:
 
119
  ```bash
120
  bash ./pre-val.sh https://<your-space>.hf.space .
121
  ```
@@ -125,15 +132,18 @@ bash ./pre-val.sh https://<your-space>.hf.space .
125
  The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.
126
 
127
  **Requirements for Inference:**
 
128
  - `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`)
129
  - `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`)
130
  - `HF_TOKEN`
131
 
132
  **Usage Flags:**
 
133
  - `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket.
134
  - `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities.
135
 
136
  Example execution tracking thoughts in medium tasks:
 
137
  ```bash
138
  python inference.py --medium --thought
139
  ```
 
53
  ## Tech Stack & Project Files
54
 
55
  This environment enforces strict typing and uses standard modern tooling:
56
+
57
  - **`uv`:** Handles dependency management (see `pyproject.toml`).
58
  - **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance.
59
  - **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`).
60
  - **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`.
61
 
62
  **File Layout:**
63
+
64
  - `models.py` / `context.py`: Domain and schema logic.
65
  - `tasks.py`: Task metadata definitions.
66
  - `sandbox.py`: Subprocess runtime and output tracking.
 
81
  ```
82
 
83
  Server endpoints available:
84
+
85
  - `POST /reset`
86
  - `POST /step`
87
  - `GET /health`
 
94
  reported with that convention in mind.
95
 
96
  | Task | Baseline Score |
97
+ | --- | --- |
98
  | `valid_parentheses_wrong_mapping` | Pending first benchmark run |
99
  | `binary_search_off_by_one` | Pending first benchmark run |
100
  | `reverse_string_returns_original` | Pending first benchmark run |
 
104
  The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance.
105
 
106
  ### Testing Locally in Docker
107
+
108
  ```bash
109
  docker build -t tracefix-rl:test -f Dockerfile .
110
  docker run --rm -p 7860:7860 tracefix-rl:test
111
  ```
112
 
113
  ### Deploy to Hugging Face Spaces
114
+
115
  This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.
116
 
117
  ```bash
 
120
  ```
121
 
122
  ### Server Pre-validation
123
+
124
  Before committing to training, you can validate your deployed server or local space:
125
+
126
  ```bash
127
  bash ./pre-val.sh https://<your-space>.hf.space .
128
  ```
 
132
  The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.
133
 
134
  **Requirements for Inference:**
135
+
136
  - `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`)
137
  - `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`)
138
  - `HF_TOKEN`
139
 
140
  **Usage Flags:**
141
+
142
  - `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket.
143
  - `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities.
144
 
145
  Example execution tracking thoughts in medium tasks:
146
+
147
  ```bash
148
  python inference.py --medium --thought
149
  ```
server/graders.py CHANGED
@@ -1,8 +1,8 @@
1
  """Task graders for TraceFix-RL.
2
 
3
  The online validator expects importable grader callables for each task entry.
4
- These graders are intentionally flexible: they prefer an explicit final score,
5
- but they can also recover a score from common env payload shapes.
6
  """
7
 
8
  from __future__ import annotations
@@ -10,17 +10,13 @@ from __future__ import annotations
10
  from collections.abc import Mapping, Sequence
11
  from typing import Any, Optional
12
 
 
 
 
13
 
14
  MIN_SCORE = 0.01
15
  MAX_SCORE = 0.98
16
 
17
- _TASK_BASELINES = {
18
- "valid_parentheses_wrong_mapping": 0.18,
19
- "binary_search_off_by_one": 0.24,
20
- "reverse_string_returns_original": 0.12,
21
- }
22
-
23
-
24
  def _clamp(score: float) -> float:
25
  return round(min(max(score, MIN_SCORE), MAX_SCORE), 4)
26
 
@@ -74,46 +70,133 @@ def _find_score_value(payload: Any) -> Optional[float]:
74
  return None
75
 
76
 
77
- def _fallback_score(task_name: str, payload: Any) -> float:
78
- baseline = _TASK_BASELINES.get(task_name, 0.15)
 
 
 
 
 
 
 
 
79
 
80
  mapping = _as_mapping(payload)
81
- action_history = None
82
  if mapping is not None:
83
- action_history = mapping.get("action_history")
84
- elif hasattr(payload, "action_history"):
85
- action_history = getattr(payload, "action_history")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
- if isinstance(action_history, Sequence) and not isinstance(action_history, (str, bytes, bytearray)):
88
- action_count = sum(1 for _ in action_history)
89
- baseline += min(0.20, action_count * 0.01)
90
- elif isinstance(payload, Sequence) and not isinstance(payload, (str, bytes, bytearray)):
91
- action_count = sum(1 for _ in payload)
92
- baseline += min(0.20, action_count * 0.01)
93
 
94
- return _clamp(baseline)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
 
97
  def grade(payload: Any = None, *args: Any, task_name: str = "", **kwargs: Any) -> float:
98
- """Return a normalized score in the project's intended range."""
99
 
100
  if payload is None and args:
101
  payload = args[0]
102
 
103
- for candidate in (payload, kwargs):
104
- if candidate is None:
105
- continue
106
- score = _find_score_value(candidate)
107
- if score is not None:
108
- return _clamp(score)
109
-
110
  if not task_name:
111
  task_name = str(kwargs.get("task_id") or kwargs.get("name") or "")
112
 
113
  if task_name:
114
- return _fallback_score(task_name, payload or kwargs)
 
115
 
116
- return _clamp(0.15)
117
 
118
 
119
  def grade_valid_parentheses_wrong_mapping(*args: Any, **kwargs: Any) -> float:
 
1
  """Task graders for TraceFix-RL.
2
 
3
  The online validator expects importable grader callables for each task entry.
4
+ These graders execute the real task tests against the final code state so the
5
+ judge can verify actual solution quality instead of a canned lookup.
6
  """
7
 
8
  from __future__ import annotations
 
10
  from collections.abc import Mapping, Sequence
11
  from typing import Any, Optional
12
 
13
+ from core.sandbox import run_code_with_tests
14
+ from tasks.tasks import ALL_TASKS
15
+
16
 
17
  MIN_SCORE = 0.01
18
  MAX_SCORE = 0.98
19
 
 
 
 
 
 
 
 
20
  def _clamp(score: float) -> float:
21
  return round(min(max(score, MIN_SCORE), MAX_SCORE), 4)
22
 
 
70
  return None
71
 
72
 
73
+ def _find_task(task_name: str) -> Optional[dict[str, Any]]:
74
+ for task in ALL_TASKS:
75
+ if task.get("name") == task_name:
76
+ return task
77
+ return None
78
+
79
+
80
+ def _extract_final_observation(payload: Any) -> Any:
81
+ if payload is None:
82
+ return None
83
 
84
  mapping = _as_mapping(payload)
 
85
  if mapping is not None:
86
+ for key in ("final_observation", "observation", "state", "last_observation"):
87
+ if key in mapping:
88
+ candidate = mapping.get(key)
89
+ if candidate is not None:
90
+ nested = _extract_final_observation(candidate)
91
+ if nested is not None:
92
+ return nested
93
+ if "trajectory" in mapping:
94
+ return _extract_final_observation(mapping.get("trajectory"))
95
+ return payload
96
+
97
+ if isinstance(payload, Sequence) and not isinstance(payload, (str, bytes, bytearray)):
98
+ if not payload:
99
+ return None
100
+ last_item = payload[-1]
101
+ if isinstance(last_item, Sequence) and not isinstance(last_item, (str, bytes, bytearray)) and len(last_item) >= 2:
102
+ return _extract_final_observation(last_item[1])
103
+ if isinstance(last_item, Mapping) or hasattr(last_item, "model_dump") or hasattr(last_item, "dict"):
104
+ return _extract_final_observation(last_item)
105
+ return last_item
106
+
107
+ return payload
108
+
109
+
110
+ def _observation_to_source(observation: Any) -> Optional[str]:
111
+ if observation is None:
112
+ return None
113
+
114
+ mapping = _as_mapping(observation)
115
+ if mapping is not None:
116
+ source = mapping.get("source")
117
+ if isinstance(source, str) and source.strip():
118
+ return source
119
+
120
+ code_lines = mapping.get("code_lines") or mapping.get("code")
121
+ if isinstance(code_lines, Sequence) and not isinstance(code_lines, (str, bytes, bytearray)):
122
+ lines = [str(line) for line in code_lines]
123
+ return "\n".join(lines)
124
+
125
+ code_dict = mapping.get("code_dict")
126
+ if isinstance(code_dict, Mapping) and code_dict:
127
+ ordered_lines: list[tuple[int, str]] = []
128
+ for key, value in code_dict.items():
129
+ try:
130
+ line_no = int(key)
131
+ except Exception:
132
+ continue
133
+ ordered_lines.append((line_no, str(value)))
134
+ if ordered_lines:
135
+ ordered_lines.sort(key=lambda item: item[0])
136
+ return "\n".join(line for _, line in ordered_lines)
137
+
138
+ for attr in ("source", "code", "code_lines", "code_dict"):
139
+ if hasattr(observation, attr):
140
+ value = getattr(observation, attr)
141
+ if isinstance(value, str) and value.strip():
142
+ return value
143
+ if isinstance(value, Sequence) and not isinstance(value, (str, bytes, bytearray)):
144
+ return "\n".join(str(line) for line in value)
145
+ if isinstance(value, Mapping) and value:
146
+ ordered_lines = []
147
+ for key, line in value.items():
148
+ try:
149
+ ordered_lines.append((int(key), str(line)))
150
+ except Exception:
151
+ continue
152
+ if ordered_lines:
153
+ ordered_lines.sort(key=lambda item: item[0])
154
+ return "\n".join(line for _, line in ordered_lines)
155
+
156
+ return None
157
+
158
 
159
+ def _evaluate_task(task_name: str, payload: Any) -> float:
160
+ task = _find_task(task_name)
161
+ if task is None:
162
+ return MIN_SCORE
 
 
163
 
164
+ final_observation = _extract_final_observation(payload)
165
+ source = _observation_to_source(final_observation)
166
+ if not source or not source.strip():
167
+ return MIN_SCORE
168
+
169
+ try:
170
+ _, results, syntax_err = run_code_with_tests(
171
+ source=source,
172
+ test_callables=task["tests"],
173
+ )
174
+ except Exception:
175
+ return MIN_SCORE
176
+
177
+ if syntax_err:
178
+ return MIN_SCORE
179
+
180
+ if results and all(test_result.passed for test_result in results):
181
+ return MAX_SCORE
182
+
183
+ return MIN_SCORE
184
 
185
 
186
  def grade(payload: Any = None, *args: Any, task_name: str = "", **kwargs: Any) -> float:
187
+ """Execute the task's real tests against the final code state."""
188
 
189
  if payload is None and args:
190
  payload = args[0]
191
 
 
 
 
 
 
 
 
192
  if not task_name:
193
  task_name = str(kwargs.get("task_id") or kwargs.get("name") or "")
194
 
195
  if task_name:
196
+ active_payload = payload if payload is not None else kwargs
197
+ return _evaluate_task(task_name, active_payload)
198
 
199
+ return MIN_SCORE
200
 
201
 
202
  def grade_valid_parentheses_wrong_mapping(*args: Any, **kwargs: Any) -> float: