Spaces:
Sleeping
Sleeping
CLAUDE.md - TraceFix-RL (RL_ENV_FINAL)
Current, code-backed notes for assistants working in this repository. Last updated: 2026-04-08
Project Status Snapshot
- Repo:
code_reasoner_rl_env - Branch:
master - Working tree: dirty
- Modified:
.gitignore,inference.py,models.py,__pycache__/models.cpython-312.pyc - Untracked:
.hfignore
- Modified:
- Last recorded pre-validation command in terminal:
./pre-val.sh https://sus-human-tracefix-rl.hf.space .- Exit code:
1
This file describes the current implementation in RL_ENV_FINAL only.
High-Level Architecture
environment.py: core gym-style state machine (TraceFixRLGym)server/tracefix_rl_environment.py: OpenEnv adapter (Environmentinterface)server/app.py: FastAPI app creation and uvicorn entrypointmodels.py: action/observation schemas (CodeAction,CodeObservation,TestResult)sandbox.py: isolated code execution + test running + timeout handlingtasks.py: static task registry (easy/medium/hard)context.py: localized context windowing around last editclient.py: typed OpenEnv client (TraceFixRLEnv/MyEnv)inference.py: baseline agent runner with OpenAI-compatible APIopenenv.yaml: OpenEnv runtime metadata (app: server.app:app,port: 7860)
Runtime and Entry Points
- Local server via project script:
uv run --project . server
- Container command in
Dockerfile:uvicorn server.app:app --host 0.0.0.0 --port 7860
- OpenEnv spec points to:
server.app:app
Environment Behavior (environment.py)
Action space:
VIEW_CODERUN_TESTSREPLACE_LINESUNDO_EDITRESET_TO_ORIGINALSUBMIT
Reward constants currently defined:
R_STEP_COST = -0.01R_RUN_TESTS = +0.10R_PER_NEW_PASS = +0.05R_SYNTAX_ERROR = -0.10R_INVALID_LINE = -0.02R_DESTRUCTIVE_PENALTY = -0.20R_UNDO_RESET = -0.10MAX_STEPS = 50
Episode internals include:
- code snapshotting (
_original_code,_edit_history) - anti-loop penalty for repeated identical
action_type - contextual anchor (
_last_edited_line) for localized context - cumulative step-cost tracking (
_accumulated_step_costs)
Submit scoring model:
proportion = passing_tests / total_tests(or0on syntax error)raw_score = proportion - _accumulated_step_costsfinal_score = clamp(raw_score, 0.0, 1.0)- same clamp model used on max-step timeout auto-evaluation
Task sampling policy:
training_step == 0: random fromALL_TASKS< 1000: easy< 5000: medium>= 5000: hard- fallback to first non-empty bucket
Schema Notes (models.py)
Important: current code uses Pydantic v2-style validation APIs.
CodeActionuses@model_validator(mode="before")- Non-
REPLACE_LINESactions forcestart_line,end_line,new_code_blocktoNone REPLACE_LINESenforces required fields and 1-indexed positive range constraints
This is not compatible with Pydantic v1-only assumptions.
Sandbox Notes (sandbox.py)
run_code_with_tests(...) returns a strict 3-tuple:
output_strList[TestResult>had_syntax_error: bool
Execution safeguards:
- subprocess isolation via
multiprocessing.Process - timeout terminate/kill path
- tail truncation (
MAX_OUTPUT_CHARS = 1000) - restricted builtins to block risky operations
Tasks Registry (tasks.py)
- Static hardcoded registry grouped by difficulty
- Exports:
TASKS_BY_DIFFICULTYALL_TASKS
- Expected total currently: 16 tasks
- easy: 4
- medium: 6
- hard: 6
OpenEnv Adapter and Client
server/tracefix_rl_environment.py:
- Maps optional reset difficulty to
training_stephints - Writes
system_promptinto observation metadata - Sets observation reward/done from gym step output
client.py:
- Sends actions using
model_dump(exclude_none=True) - Parses OpenEnv payloads into typed
CodeObservation
Inference Runner (inference.py)
Key defaults:
API_BASE_URL = https://router.huggingface.co/v1MODEL_NAME = Qwen/Qwen2.5-72B-InstructMAX_STEPS = 50SUCCESS_SCORE_THRESHOLD = 0.99THINKING_TOKEN_LIMIT = 512
Behavior:
- Logs in strict sequence:
[START], repeated[STEP], then[END] - Uses JSON extraction fallback path from model text
- Falls back to
RUN_TESTSon parse or validation failure - Supports
--easy,--medium,--hard,--debug
Drift and Risk Notes
requirements.txtcurrently pinspydantic==1.10.17, but code inmodels.pyuses v2 APIs (model_validator).pyproject.tomlis the active dependency source foruv sync;requirements.txtappears stale relative to runtime assumptions.environment.pydefinesR_SUBMIT_ALL_PASSandR_SUBMIT_FAIL, but submit currently uses clamped proportion-minus-step-cost scoring instead of those constants.server/tracefix_rl_environment.pyadvertises concurrent sessions support, whilecreate_app(..., max_concurrent_envs=1)constrains server-level concurrency.
Practical Checklist Before Validation
- Confirm dependency source of truth (
pyproject.tomlvsrequirements.txt) and align Pydantic version expectations. - Re-run pre-validation and capture the first failing check/output.
- Remove tracked cache artifacts from version control if unintended (for example
__pycache__/*.pyc). - Keep stdout format in
inference.pyunchanged, as validator parsing depends on it.