databoysu commited on
Commit
33ef871
·
1 Parent(s): 1dfb089

local test

Browse files
.dockerignore ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Local dev / VCS
2
+ .git
3
+ .gitignore
4
+ .hfignore
5
+
6
+ # Python
7
+ __pycache__/
8
+ *.pyc
9
+ *.pyo
10
+ .venv/
11
+ .pytest_cache/
12
+ .mypy_cache/
13
+ *.egg-info/
14
+
15
+ # Node / tooling
16
+ node_modules/
17
+
18
+ # Secrets / local config
19
+ .env
20
+ .env.*
21
+
22
+ # Outputs
23
+ outputs/
24
+ *.log
25
+
26
+ .agent/
27
+ .agents/
28
+ .github/
29
+
30
+ CLAUDE.md
31
+ package.json
32
+ package-lock.json
.gitignore CHANGED
@@ -1,6 +1,12 @@
1
- .venv
2
- .agents
3
  .env
4
  uv.lock
5
  claude.md
6
- __pycache__/
 
 
 
 
 
 
 
1
+ .venv/
2
+ .agents/
3
  .env
4
  uv.lock
5
  claude.md
6
+ __pycache__/
7
+ _pycache_/
8
+ node_modules/
9
+ package.json
10
+ package-lock.json
11
+ .github/
12
+ .agent/
.hfignore ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python/cache
2
+ __pycache__/
3
+ *.pyc
4
+ *.pyo
5
+ *.pyd
6
+
7
+ # Virtual envs / local tooling
8
+ .venv/
9
+ .agents/
10
+ node_modules/
11
+ .agent/
12
+
13
+ # Local secrets/config
14
+ .env
15
+ .env.*
16
+
17
+ # VCS and editor noise
18
+ .git/
19
+ .gitignore
20
+ .DS_Store
21
+ .idea/
22
+ .vscode/
23
+ CLAUDE.md
24
+
25
+ # Build / generated artifacts
26
+ *.egg-info/
27
+ dist/
28
+ build/
29
+ .pytest_cache/
30
+ .mypy_cache/
31
+
32
+ # Local outputs/logs
33
+ outputs/
34
+ *.log
35
+ .github/
CLAUDE.md CHANGED
@@ -1,356 +1,163 @@
1
- # CLAUDE.md TraceFix-RL
2
 
3
- Codebase knowledge for AI assistants. Read before making changes.
 
4
 
5
- **Phase status:**
6
- - Phase 1 — REPLACE_LINES action space ✅
7
- - Phase 1B — Hackathon compliance (clamper + inference.py) ✅
8
- - Phase 2 — Mini-Git (UNDO_EDIT + RESET_TO_ORIGINAL) ✅
9
- - Phase 3 — Curriculum learning + static task registry ✅
10
- - Phase 4 — Constrained decoding + Chain-of-Thought + "Reasoning Model" support ✅
11
-
12
- ---
13
-
14
- ## File Map
15
-
16
- ```
17
- models.py ← Pydantic v1 schema (action + observation spaces)
18
- tasks.py ← Static task registry (16 hardcoded curated tasks)
19
- sandbox.py ← Isolated multiprocessing executor
20
- environment.py ← RL state machine (reset/step/reward/clamping/curriculum)
21
- context.py ← ±10-line localized view around last edit
22
- server.py ← FastAPI WebSocket + HTTP server
23
- inference.py ← Hackathon baseline agent (OpenAI client)
24
- openenv.yaml ← OpenEnv metadata (validate-submission.sh)
25
- Dockerfile ← 2-stage HuggingFace Spaces build (production)
26
- requirements.txt
27
-
28
- ── Offline tools (NOT in Docker image) ────────────────────────────────────
29
- mutation_engine.py ← Bug injection operators — run locally to generate tasks
30
- dataset_generator.py← Validate + build task dicts from base solutions
31
- ```
32
-
33
- **Critical:** `mutation_engine.py` and `dataset_generator.py` are **not copied into the Docker image**. They are local data-science tools only. `tasks.py` must not import them.
34
-
35
- ---
36
-
37
- ## `models.py`
38
-
39
- - **Pydantic v1** (`pydantic==1.10.17`). Never upgrade — `.dict()`, `.parse_raw()`, `.json()`, `@validator` are v1 APIs used everywhere.
40
- - `ActionType`: exactly **6** strings — `"VIEW_CODE"`, `"RUN_TESTS"`, `"REPLACE_LINES"`, `"UNDO_EDIT"`, `"RESET_TO_ORIGINAL"`, `"SUBMIT"`.
41
- - `CodeAction(extra="forbid")` — any extra JSON key raises `ValidationError`.
42
-
43
- ### CodeAction fields
44
-
45
- | Field | Type | Required for |
46
- |---|---|---|
47
- | `thought` | `Optional[str]` | Always (Chain-of-thought scratchpad) |
48
- | `action_type` | `ActionType` | always |
49
- | `start_line` | `Optional[int]` (ge=1) | `REPLACE_LINES` |
50
- | `end_line` | `Optional[int]` (ge=1) | `REPLACE_LINES` |
51
- | `new_code_block` | `Optional[str]` | `REPLACE_LINES` |
52
-
53
- No extra fields for `UNDO_EDIT` or `RESET_TO_ORIGINAL`.
54
-
55
- ### CodeObservation key fields
56
-
57
- | Field | Notes |
58
- |---|---|
59
- | `code_lines: List[str]` | Complete current source (authoritative) |
60
- | `localized_context: str` | ±10 lines around last edit; empty until first REPLACE_LINES |
61
- | `last_execution_output: str` | Tail of stdout+stderr from last RUN_TESTS/SUBMIT |
62
- | `syntax_error: bool` | `ast.parse()` check, updated every step |
63
- | `test_results: List[TestResult]` | Per-test pass/fail + error_message |
64
- | `step_count / steps_remaining` | Progress vs MAX_STEPS=50 |
65
- | `reward_last_step: float` | Per-step RL signal |
66
- | `done: bool` | Episode ended |
67
- | `info: dict` | `episode_id`, `task_name`, `task_difficulty` |
68
-
69
- ---
70
-
71
- ## `tasks.py` — Static Registry
72
-
73
- **This file is a dumb registry.** It contains only hardcoded dicts — no imports from `mutation_engine` or `dataset_generator`. Zero cold-start cost; fully deterministic for evaluators.
74
-
75
- To add new tasks: run `mutation_engine.py` + `dataset_generator.py` locally, curate the best outputs, paste them in as hardcoded dicts.
76
-
77
- ### Exported symbols
78
-
79
- | Symbol | Type | Description |
80
- |---|---|---|
81
- | `TASKS_BY_DIFFICULTY` | `Dict[str, List[Dict]]` | Tasks grouped by difficulty tier |
82
- | `ALL_TASKS` | `List[Dict]` | Flat list of all tasks (for random sampling) |
83
-
84
- **Current registry size:** `easy=4`, `medium=6`, `hard=6` → 16 tasks total.
85
-
86
- ### Task dict schema
87
-
88
- ```python
89
- {
90
- "name": str, # e.g. "binary_search_off_by_one"
91
- "description": str,
92
- "code": List[str], # buggy version, lines without trailing \n
93
- "solution": List[str], # correct version
94
- "tests": List[Callable],# accept (namespace_dict), raise AssertionError
95
- "difficulty": str, # "easy" | "medium" | "hard"
96
- "bug_type": str, # e.g. "wrong_operator" or "logic_inversion"
97
- }
98
- ```
99
-
100
- ### Task catalogue
101
-
102
- | Name | Bug | Difficulty |
103
- |---|---|---|
104
- | `sum_even_wrong_condition` | `!= 0` instead of `== 0` | easy |
105
- | `sum_even_missing_accumulator` | `-=` instead of `+=` | easy |
106
- | `reverse_string_wrong_step` | `[::-2]` instead of `[::-1]` | easy |
107
- | `reverse_string_returns_original` | `[::1]` instead of `[::-1]` | easy |
108
- | `binary_search_off_by_one` | `right = len(arr)` instead of `len(arr)-1` | medium |
109
- | `binary_search_wrong_mid` | `left + right` instead of `(left + right) // 2` | medium |
110
- | `flatten_missing_recursion` | `append` instead of `extend(flatten(item))` | medium |
111
- | `flatten_inverted_branch` | `not isinstance` inverts the recursive branch | medium |
112
- | `word_count_no_lower` | missing `text = text.lower()` | medium |
113
- | `word_count_no_punct_strip` | missing punctuation stripping | medium |
114
- | `lru_cache_wrong_eviction` | `pop(-1)` instead of `pop(0)` — evicts MRU | hard |
115
- | `lru_cache_no_promotion` | `get()` doesn't move key to most-recently-used | hard |
116
- | `valid_parentheses_wrong_mapping` | all three bracket mappings are wrong | hard |
117
- | `valid_parentheses_no_empty_check` | missing `not stack or` guard before `pop()` | hard |
118
- | `merge_intervals_strict_overlap` | `<` instead of `<=` — touching intervals not merged | hard |
119
- | `merge_intervals_missing_sort` | missing `intervals.sort()` | hard |
120
-
121
- ---
122
-
123
- ## `environment.py`
124
-
125
- ### Interface
126
- ```python
127
- obs, system_prompt = env.reset()
128
- obs, reward, done, info = env.step(action: CodeAction)
129
- ```
130
-
131
- ### Reward constants
132
- ```python
133
- R_STEP_COST = -0.01 # every step (RL signal only)
134
- R_RUN_TESTS = +0.10
135
- R_PER_NEW_PASS = +0.05 # per newly passing test
136
- R_INVALID_LINE = -0.02
137
- R_SYNTAX_ERROR = -0.10 # inside _act_run_tests on syntax failure
138
- R_UNDO_RESET = -0.10 # UNDO_EDIT and RESET_TO_ORIGINAL
139
- MAX_STEPS = 50
140
- ```
141
-
142
- ### Episode state (ALL reset in `reset()`)
143
-
144
- - **System Prompt**: Enforces SOP (Standard Operating Procedure: ORIENT → DIAGNOSE → FIX → VERIFY → REPEAT → SUBMIT) and strictly forbids consecutive `VIEW_CODE` calls.
145
-
146
- | Field | Description |
147
- |---|---|
148
- | `_code_lines` | Working copy of buggy code |
149
- | `_task` | Current task dict |
150
- | `_step_count` | Steps this episode |
151
- | `_prev_pass_count` | Test passes at last RUN_TESTS |
152
- | `_last_test_results` | From last RUN_TESTS/SUBMIT |
153
- | `_last_output` | Text output for observation |
154
- | `_last_edited_line` | 1-indexed anchor for context.py |
155
- | `_episode_id` | 8-char UUID prefix |
156
- | `_done` | Episode finished |
157
- | `_cumulative_reward` | Sum of all step rewards |
158
- | `_accumulated_step_costs` | `count × 0.01` — used by hackathon clamper |
159
- | `_original_code` | Deep copy of episode-start code; never mutated |
160
- | `_edit_history` | Stack of `List[str]` snapshots; one pushed before each REPLACE_LINES |
161
-
162
- `training_step: int = 0` — **not reset by `reset()`**. Persists across episodes. Set externally by trainer.
163
-
164
- ### `_sample_task()` — Evaluation-safe curriculum sampler
165
-
166
- Priority order:
167
-
168
- 1. **`task_override=dict`** → return it directly (eval/test pinning)
169
- 2. **`training_step == 0`** → random pick from `ALL_TASKS` ← **judge-safe default**
170
- - The Meta evaluator calls `reset()` without setting `training_step`, so this must not crash or bias to one bucket
171
- 3. **`training_step > 0`** → curriculum bucketing:
172
- - `< 1000` → easy
173
- - `1000 – 4999` → medium
174
- - `>= 5000` → hard
175
- - Falls back to any non-empty bucket if the target is empty
176
-
177
- ### Action handlers
178
-
179
- | Method | Delta reward | Key behavior |
180
- |---|---|---|
181
- | `_act_view_code()` | 0.0 | Sets `_last_output` with numbered source |
182
- | `_act_run_tests()` | `R_RUN_TESTS` ± syntax ± new passes | Updates `_prev_pass_count` |
183
- | `_act_replace_lines(s, e, block)` | 0.0 or `R_INVALID_LINE` | Snapshots before mutating; slice assign; anchor = end of new block; blocks deletion of >5 lines (`R_DESTRUCTIVE_PENALTY`) |
184
- | `_act_undo_edit()` | `R_UNDO_RESET` (-0.10) | Pops `_edit_history`; sets `_last_edited_line = None` |
185
- | `_act_reset_to_original()` | `R_UNDO_RESET` (-0.10) | Restores `_original_code`; clears `_edit_history`; sets `_last_edited_line = None` |
186
- | `_act_submit()` | clamped [0.0, 1.0] | Hackathon score formula |
187
-
188
- **Action Penalties**:
189
- - **Anti-Loop**: `step()` applies an escalating `-0.05 * n` penalty if the agent chooses the exact same `action_type` repeatedly.
190
- - **Escape Hatch Rule**: The prompt explicitly warns against manual space-fixing on syntax/indent errors, directing the agent to use `UNDO_EDIT` or `RESET_TO_ORIGINAL`.
191
-
192
- ### Hackathon Reward Clamper (`_act_submit` & Timeout)
193
-
194
- ```python
195
- proportion = passes / total # 0.0 on syntax error
196
- raw_score = proportion - self._accumulated_step_costs
197
- final_score = max(0.0, min(1.0, raw_score))
198
- ```
199
 
200
- - **Deterministic Evaluation**: Floor ≥0.0 and <=1.0 guaranteed.
201
- - **Trigger**: Runs on `SUBMIT` **or** when hitting `MAX_STEPS` timeout. Never trusts the LLM to call `SUBMIT`.
202
- - Stored in `info["final_score"]` when `done=True`.
 
 
 
 
 
203
 
204
- ---
205
 
206
- ## `context.py`
207
 
208
- `get_localized_context(code_lines, anchor_line, window=10) -> str`
209
- - Returns `""` if `anchor_line is None` or `code_lines` is empty.
210
- - Uses `len(code_lines)` dynamically handles REPLACE_LINES growth/shrink correctly.
211
- - Hard cap: `MAX_CONTEXT_CHARS = 2_000`.
 
 
 
 
 
 
212
 
213
- ---
214
 
215
- ## `sandbox.py`
 
 
 
 
 
216
 
217
- `run_code_with_tests(source: str, callables, timeout=5) -> (output_str, List[TestResult], had_syntax_error)`
218
 
219
- - **Always a 3-tuple.** Never access as an object (no `.all_pass`, no `.test_results`).
220
- - `source` must be a `str` — call `"\n".join(code_lines)` before passing.
221
- - Isolation: `multiprocessing.Process`, SIGTERM → SIGKILL on timeout.
222
- - Output tail-truncated to `MAX_OUTPUT_CHARS = 1_000`.
223
 
224
- ---
 
 
 
 
 
225
 
226
- ## `server.py`
227
 
228
- FastAPI WebSocket layer. Port: `os.environ.get("PORT", 7860)`.
 
 
 
 
 
 
 
229
 
230
- | Endpoint | Notes |
231
- |---|---|
232
- | `GET /health` | Liveness probe |
233
- | `GET /info` | Env metadata + `CodeAction.schema()` |
234
- | `POST /reset` | Stateless, new env per request |
235
- | `WS /ws` | Primary RL channel — auto-resets on `done=True`. Append `?difficulty=easy|medium|hard` to set tier. |
236
 
237
- ---
 
 
 
238
 
239
- ## `inference.py`
240
 
241
- Config from `os.getenv`:
 
 
 
242
 
243
- | Variable | Default | Notes |
244
- |---|---|---|
245
- | `API_BASE_URL` | `https://api.openai.com/v1` | OpenEnv compatible proxy URL |
246
- | `MODEL_NAME` | `gpt-4o` | Robust fallback model if missing |
247
- | `HF_TOKEN` | `""` | Optional HuggingFace Token |
248
- | `ENV_WS_URL` | `ws://localhost:7860/ws` | Connecting environment URL |
249
- | `DEBUG_LOG` | `0` | Set to `1` to print raw LLM output |
250
 
251
- **CLI Flags:**
252
- - `python inference.py --easy` (or `--medium`, `--hard`) appends `?difficulty=...` parameter to the WS URL to override `training_step` bucketing.
 
 
 
253
 
254
- ### Decoding & Fallbacks
255
 
256
- - **Structured Output**: Uses `json_schema` protocol with strict `CodeAction` forcing `thought` generation before `action_type`.
257
- - **Reasoning Models**: Directly parses `.model_dump()["reasoning_content"]` if `content` is empty (e.g. DeepSeek-R1 / Nemotron in LM Studio).
258
- - **Mask-Free Parser**: Invalid JSON explicitly returns `PARSE_ERROR` to the server (preventing silent `VIEW_CODE` loops), forcing LLM self-correction.
259
 
260
- **Exact stdout log format (regex-parsed by validation judge):**
261
- ```
262
- [START] task=<task_name> env=TraceFixRL model=<model_name>
263
- [STEP] step=<n> action=<action_type> reward=<r.rr> done=<true|false> error=<msg|null>
264
- [END] success=<true|false> steps=<n> score=<s.sss> rewards=<r1,r2,...,rn>
265
- ```
266
 
267
- - `reward` `:.2f`; `done` lowercase; `error` → `"null"` on success.
268
- - `score` → `:.3f` — pulled from `info["final_score"]` (the clamped [0,1] value).
269
- - `rewards` → comma-separated, no spaces.
270
 
271
- ---
272
 
273
- ## `openenv.yaml`
274
 
275
- Consumed by `openenv validate` step in `validate-submission.sh`.
 
 
276
 
277
- Key fields: `reward_range: [0.0, 1.0]`, `inference_script: inference.py`, `websocket_path: /ws`, `port: 7860`.
278
 
279
- ---
 
 
 
280
 
281
- ## `Dockerfile`
282
 
283
- Two-stage build. Runtime COPY (all with `--chown=appuser:appuser`):
284
- ```
285
- models.py environment.py sandbox.py tasks.py
286
- server.py context.py inference.py
287
- ```
 
 
 
288
 
289
- **`mutation_engine.py` and `dataset_generator.py` are NOT copied.** They are offline tools.
290
 
291
- ---
292
 
293
- ## Offline Tools (local only)
 
 
294
 
295
- ### `mutation_engine.py`
296
 
297
- `MutationEngine(seed).mutate(code_lines, difficulty, max_attempts=10)`
298
- `(List[str], {"bug_type": str, "num_bugs": int})` or `(None, None)`
299
 
300
- Operator sets:
301
 
302
- | Difficulty | Operators |
303
- |---|---|
304
- | easy | `_var_name_error`, `_wrong_operator` |
305
- | medium | easy + `_off_by_one`, `_logic_inversion`, `_index_error` |
306
- | hard | medium + `_mutable_default`, `_remove_return`, `_wrong_function_call` |
307
 
308
- ### `dataset_generator.py`
 
 
 
 
309
 
310
- `validate_task(original, mutated, tests)` — original must pass all tests; mutated must fail ≥ 1.
311
- `generate_task(base_task, mutator)` — calls mutate + validate; returns task dict or `None`.
312
 
313
- **Workflow to add new tasks:**
314
- ```bash
315
- python -c "
316
- from mutation_engine import MutationEngine
317
- from dataset_generator import generate_task
318
- # define base_task with solution + tests
319
- # run generate_task, inspect output, paste into tasks.py
320
- "
321
- ```
322
 
323
- ---
324
 
325
- ## Dependencies
 
 
 
326
 
327
- ```
328
- fastapi==0.111.0
329
- uvicorn[standard]==0.30.1
330
- pydantic==1.10.17 ← v1 ONLY
331
- websockets==12.0
332
- openai>=1.30.0 ← inference.py only
333
- ```
334
 
335
- IDE lint warnings for these packages are expected false-positives they live in Docker, not system Python.
336
-
337
- ---
338
-
339
- ## Invariants
340
-
341
- 1. **Pydantic v1 only.** Never upgrade.
342
- 2. **1-indexed lines in public API**; 0-indexed in `_code_lines`.
343
- 3. `reset()` wipes every mutable field including `_accumulated_step_costs`, `_original_code`, `_edit_history`. `training_step` is NOT reset.
344
- 4. Reward delta model — handlers return delta; `R_STEP_COST` applied once per step before routing.
345
- 5. REPLACE_LINES anchor = `min(start + len(new_lines) - 1, file_length)`.
346
- 6. SUBMIT reward clamped `[0.0, 1.0]` — this is the grader score. Floor guaranteed ≥ 0.0.
347
- 7. `_act_run_tests()` updates `_prev_pass_count`; `_act_submit()` does not.
348
- 8. Task `code` strings have no trailing `\n`; `_source()` joins with `\n`.
349
- 9. `context.py` is already fully dynamic — no changes needed for REPLACE_LINES growth/shrink.
350
- 10. Output truncation is **tail-based** (end of traceback = actionable info).
351
- 11. **Mini-Git snapshot timing**: snapshot pushed **before** slice assignment. Rejected edits (OOB, inverted range) produce no snapshot.
352
- 12. **Context desync invariant**: Both rollback handlers set `_last_edited_line = None`. Without this, `context.py` anchors to a ghost line after revert.
353
- 13. **`_original_code` is immutable**: set once in `reset()`, only read in `_act_reset_to_original()`.
354
- 14. **`sandbox.run_code_with_tests` returns a 3-tuple**: `(output_str, List[TestResult], had_syntax_error)`. Never treat as object.
355
- 15. **`tasks.py` must not import `mutation_engine` or `dataset_generator`**: those are offline tools not in the Docker image.
356
- 16. **`training_step == 0` → random from ALL_TASKS**: the judge calls `reset()` with default `training_step=0`, so this path must work correctly and not bias to one difficulty bucket.
 
1
+ # CLAUDE.md - TraceFix-RL (RL_ENV_FINAL)
2
 
3
+ Current, code-backed notes for assistants working in this repository.
4
+ Last updated: 2026-04-08
5
 
6
+ ## Project Status Snapshot
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
+ - Repo: `code_reasoner_rl_env`
9
+ - Branch: `master`
10
+ - Working tree: dirty
11
+ - Modified: `.gitignore`, `inference.py`, `models.py`, `__pycache__/models.cpython-312.pyc`
12
+ - Untracked: `.hfignore`
13
+ - Last recorded pre-validation command in terminal:
14
+ - `./pre-val.sh https://sus-human-tracefix-rl.hf.space .`
15
+ - Exit code: `1`
16
 
17
+ This file describes the current implementation in `RL_ENV_FINAL` only.
18
 
19
+ ## High-Level Architecture
20
 
21
+ - `environment.py`: core gym-style state machine (`TraceFixRLGym`)
22
+ - `server/tracefix_rl_environment.py`: OpenEnv adapter (`Environment` interface)
23
+ - `server/app.py`: FastAPI app creation and uvicorn entrypoint
24
+ - `models.py`: action/observation schemas (`CodeAction`, `CodeObservation`, `TestResult`)
25
+ - `sandbox.py`: isolated code execution + test running + timeout handling
26
+ - `tasks.py`: static task registry (easy/medium/hard)
27
+ - `context.py`: localized context windowing around last edit
28
+ - `client.py`: typed OpenEnv client (`TraceFixRLEnv` / `MyEnv`)
29
+ - `inference.py`: baseline agent runner with OpenAI-compatible API
30
+ - `openenv.yaml`: OpenEnv runtime metadata (`app: server.app:app`, `port: 7860`)
31
 
32
+ ## Runtime and Entry Points
33
 
34
+ - Local server via project script:
35
+ - `uv run --project . server`
36
+ - Container command in `Dockerfile`:
37
+ - `uvicorn server.app:app --host 0.0.0.0 --port 7860`
38
+ - OpenEnv spec points to:
39
+ - `server.app:app`
40
 
41
+ ## Environment Behavior (`environment.py`)
42
 
43
+ Action space:
 
 
 
44
 
45
+ - `VIEW_CODE`
46
+ - `RUN_TESTS`
47
+ - `REPLACE_LINES`
48
+ - `UNDO_EDIT`
49
+ - `RESET_TO_ORIGINAL`
50
+ - `SUBMIT`
51
 
52
+ Reward constants currently defined:
53
 
54
+ - `R_STEP_COST = -0.01`
55
+ - `R_RUN_TESTS = +0.10`
56
+ - `R_PER_NEW_PASS = +0.05`
57
+ - `R_SYNTAX_ERROR = -0.10`
58
+ - `R_INVALID_LINE = -0.02`
59
+ - `R_DESTRUCTIVE_PENALTY = -0.20`
60
+ - `R_UNDO_RESET = -0.10`
61
+ - `MAX_STEPS = 50`
62
 
63
+ Episode internals include:
 
 
 
 
 
64
 
65
+ - code snapshotting (`_original_code`, `_edit_history`)
66
+ - anti-loop penalty for repeated identical `action_type`
67
+ - contextual anchor (`_last_edited_line`) for localized context
68
+ - cumulative step-cost tracking (`_accumulated_step_costs`)
69
 
70
+ Submit scoring model:
71
 
72
+ - `proportion = passing_tests / total_tests` (or `0` on syntax error)
73
+ - `raw_score = proportion - _accumulated_step_costs`
74
+ - `final_score = clamp(raw_score, 0.0, 1.0)`
75
+ - same clamp model used on max-step timeout auto-evaluation
76
 
77
+ Task sampling policy:
 
 
 
 
 
 
78
 
79
+ - `training_step == 0`: random from `ALL_TASKS`
80
+ - `< 1000`: easy
81
+ - `< 5000`: medium
82
+ - `>= 5000`: hard
83
+ - fallback to first non-empty bucket
84
 
85
+ ## Schema Notes (`models.py`)
86
 
87
+ Important: current code uses Pydantic v2-style validation APIs.
 
 
88
 
89
+ - `CodeAction` uses `@model_validator(mode="before")`
90
+ - Non-`REPLACE_LINES` actions force `start_line`, `end_line`, `new_code_block` to `None`
91
+ - `REPLACE_LINES` enforces required fields and 1-indexed positive range constraints
 
 
 
92
 
93
+ This is not compatible with Pydantic v1-only assumptions.
 
 
94
 
95
+ ## Sandbox Notes (`sandbox.py`)
96
 
97
+ `run_code_with_tests(...)` returns a strict 3-tuple:
98
 
99
+ - `output_str`
100
+ - `List[TestResult>`
101
+ - `had_syntax_error: bool`
102
 
103
+ Execution safeguards:
104
 
105
+ - subprocess isolation via `multiprocessing.Process`
106
+ - timeout terminate/kill path
107
+ - tail truncation (`MAX_OUTPUT_CHARS = 1000`)
108
+ - restricted builtins to block risky operations
109
 
110
+ ## Tasks Registry (`tasks.py`)
111
 
112
+ - Static hardcoded registry grouped by difficulty
113
+ - Exports:
114
+ - `TASKS_BY_DIFFICULTY`
115
+ - `ALL_TASKS`
116
+ - Expected total currently: 16 tasks
117
+ - easy: 4
118
+ - medium: 6
119
+ - hard: 6
120
 
121
+ ## OpenEnv Adapter and Client
122
 
123
+ `server/tracefix_rl_environment.py`:
124
 
125
+ - Maps optional reset difficulty to `training_step` hints
126
+ - Writes `system_prompt` into observation metadata
127
+ - Sets observation reward/done from gym step output
128
 
129
+ `client.py`:
130
 
131
+ - Sends actions using `model_dump(exclude_none=True)`
132
+ - Parses OpenEnv payloads into typed `CodeObservation`
133
 
134
+ ## Inference Runner (`inference.py`)
135
 
136
+ Key defaults:
 
 
 
 
137
 
138
+ - `API_BASE_URL = https://router.huggingface.co/v1`
139
+ - `MODEL_NAME = Qwen/Qwen2.5-72B-Instruct`
140
+ - `MAX_STEPS = 50`
141
+ - `SUCCESS_SCORE_THRESHOLD = 0.99`
142
+ - `THINKING_TOKEN_LIMIT = 512`
143
 
144
+ Behavior:
 
145
 
146
+ - Logs in strict sequence: `[START]`, repeated `[STEP]`, then `[END]`
147
+ - Uses JSON extraction fallback path from model text
148
+ - Falls back to `RUN_TESTS` on parse or validation failure
149
+ - Supports `--easy`, `--medium`, `--hard`, `--debug`
 
 
 
 
 
150
 
151
+ ## Drift and Risk Notes
152
 
153
+ 1. `requirements.txt` currently pins `pydantic==1.10.17`, but code in `models.py` uses v2 APIs (`model_validator`).
154
+ 2. `pyproject.toml` is the active dependency source for `uv sync`; `requirements.txt` appears stale relative to runtime assumptions.
155
+ 3. `environment.py` defines `R_SUBMIT_ALL_PASS` and `R_SUBMIT_FAIL`, but submit currently uses clamped proportion-minus-step-cost scoring instead of those constants.
156
+ 4. `server/tracefix_rl_environment.py` advertises concurrent sessions support, while `create_app(..., max_concurrent_envs=1)` constrains server-level concurrency.
157
 
158
+ ## Practical Checklist Before Validation
 
 
 
 
 
 
159
 
160
+ 1. Confirm dependency source of truth (`pyproject.toml` vs `requirements.txt`) and align Pydantic version expectations.
161
+ 2. Re-run pre-validation and capture the first failing check/output.
162
+ 3. Remove tracked cache artifacts from version control if unintended (for example `__pycache__/*.pyc`).
163
+ 4. Keep stdout format in `inference.py` unchanged, as validator parsing depends on it.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
__pycache__/__init__.cpython-312.pyc CHANGED
Binary files a/__pycache__/__init__.cpython-312.pyc and b/__pycache__/__init__.cpython-312.pyc differ
 
__pycache__/client.cpython-312.pyc CHANGED
Binary files a/__pycache__/client.cpython-312.pyc and b/__pycache__/client.cpython-312.pyc differ
 
__pycache__/context.cpython-312.pyc CHANGED
Binary files a/__pycache__/context.cpython-312.pyc and b/__pycache__/context.cpython-312.pyc differ
 
__pycache__/environment.cpython-312.pyc CHANGED
Binary files a/__pycache__/environment.cpython-312.pyc and b/__pycache__/environment.cpython-312.pyc differ
 
__pycache__/models.cpython-312.pyc CHANGED
Binary files a/__pycache__/models.cpython-312.pyc and b/__pycache__/models.cpython-312.pyc differ
 
__pycache__/sandbox.cpython-312.pyc CHANGED
Binary files a/__pycache__/sandbox.cpython-312.pyc and b/__pycache__/sandbox.cpython-312.pyc differ
 
__pycache__/tasks.cpython-312.pyc CHANGED
Binary files a/__pycache__/tasks.cpython-312.pyc and b/__pycache__/tasks.cpython-312.pyc differ
 
inference.py CHANGED
@@ -36,10 +36,10 @@ except Exception:
36
  from models import CodeAction
37
 
38
 
39
- API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
40
- MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
41
- HF_TOKEN = os.getenv("HF_TOKEN", "")
42
- LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "")
43
 
44
  ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
45
  TASK_NAME = os.getenv("TASK_NAME", "tracefix_rl")
 
36
  from models import CodeAction
37
 
38
 
39
+ API_BASE_URL = os.getenv("API_BASE_URL", "http://localhost:1234/v1")
40
+ MODEL_NAME = os.getenv("MODEL_NAME", "nvidia/nemotron-3-nano-4b")
41
+ HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or "lm-studio"
42
+ LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
43
 
44
  ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
45
  TASK_NAME = os.getenv("TASK_NAME", "tracefix_rl")
models.py CHANGED
@@ -5,7 +5,7 @@ from __future__ import annotations
5
  from typing import Any, Dict, List, Literal, Optional
6
 
7
  from openenv.core.env_server.types import Action, Observation
8
- from pydantic import BaseModel, Field, field_validator, model_validator
9
 
10
 
11
  ActionType = Literal[
@@ -33,42 +33,55 @@ class CodeAction(Action):
33
  end_line: Optional[int] = Field(default=None)
34
  new_code_block: Optional[str] = Field(default=None)
35
 
36
- @field_validator("start_line", "end_line", mode="before")
37
  @classmethod
38
- def _coerce_optional_int(cls, value: Any) -> Optional[int]:
39
- if value is None:
40
- return None
41
- if isinstance(value, str):
42
- raw = value.strip()
43
- if raw == "":
44
- return None
45
- try:
46
- return int(raw)
47
- except ValueError:
48
  return None
49
- if isinstance(value, int):
50
- return value
51
- return None
52
-
53
- @model_validator(mode="after")
54
- def validate_replace_fields(self) -> "CodeAction":
55
- if self.action_type == "REPLACE_LINES":
56
- if self.start_line is None:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  raise ValueError("REPLACE_LINES requires start_line.")
58
- if self.end_line is None:
59
  raise ValueError("REPLACE_LINES requires end_line.")
60
- if self.new_code_block is None:
61
  raise ValueError("REPLACE_LINES requires new_code_block.")
62
- if self.start_line < 1 or self.end_line < 1:
63
  raise ValueError("REPLACE_LINES requires start_line and end_line >= 1.")
64
- if self.start_line > self.end_line:
65
  raise ValueError("REPLACE_LINES requires start_line <= end_line.")
66
  else:
67
- # Ignore extra form fields for non-edit actions (web UI often sends defaults).
68
- self.start_line = None
69
- self.end_line = None
70
- self.new_code_block = None
71
- return self
 
72
 
73
 
74
  class TestResult(BaseModel):
 
5
  from typing import Any, Dict, List, Literal, Optional
6
 
7
  from openenv.core.env_server.types import Action, Observation
8
+ from pydantic import BaseModel, Field, model_validator
9
 
10
 
11
  ActionType = Literal[
 
33
  end_line: Optional[int] = Field(default=None)
34
  new_code_block: Optional[str] = Field(default=None)
35
 
36
+ @model_validator(mode="before")
37
  @classmethod
38
+ def validate_and_normalize(cls, data: Any) -> Any:
39
+ if not isinstance(data, dict):
40
+ return data
41
+
42
+ action_type = data.get("action_type")
43
+
44
+ def _coerce_optional_int(value: Any) -> Optional[int]:
45
+ if value is None:
 
 
46
  return None
47
+ if isinstance(value, int):
48
+ return value
49
+ if isinstance(value, str):
50
+ raw = value.strip()
51
+ if raw == "":
52
+ return None
53
+ try:
54
+ return int(raw)
55
+ except ValueError:
56
+ return None
57
+ return None
58
+
59
+ data = dict(data)
60
+ data["start_line"] = _coerce_optional_int(data.get("start_line"))
61
+ data["end_line"] = _coerce_optional_int(data.get("end_line"))
62
+
63
+ if action_type == "REPLACE_LINES":
64
+ start_line = data.get("start_line")
65
+ end_line = data.get("end_line")
66
+ new_code_block = data.get("new_code_block")
67
+
68
+ if start_line is None:
69
  raise ValueError("REPLACE_LINES requires start_line.")
70
+ if end_line is None:
71
  raise ValueError("REPLACE_LINES requires end_line.")
72
+ if new_code_block is None:
73
  raise ValueError("REPLACE_LINES requires new_code_block.")
74
+ if start_line < 1 or end_line < 1:
75
  raise ValueError("REPLACE_LINES requires start_line and end_line >= 1.")
76
+ if start_line > end_line:
77
  raise ValueError("REPLACE_LINES requires start_line <= end_line.")
78
  else:
79
+ # Web UI often sends default line fields for non-edit actions.
80
+ data["start_line"] = None
81
+ data["end_line"] = None
82
+ data["new_code_block"] = None
83
+
84
+ return data
85
 
86
 
87
  class TestResult(BaseModel):
server/__pycache__/__init__.cpython-312.pyc CHANGED
Binary files a/server/__pycache__/__init__.cpython-312.pyc and b/server/__pycache__/__init__.cpython-312.pyc differ