tracefix_rl / CLAUDE.md
databoysu
local test
33ef871

CLAUDE.md - TraceFix-RL (RL_ENV_FINAL)

Current, code-backed notes for assistants working in this repository. Last updated: 2026-04-08

Project Status Snapshot

  • Repo: code_reasoner_rl_env
  • Branch: master
  • Working tree: dirty
    • Modified: .gitignore, inference.py, models.py, __pycache__/models.cpython-312.pyc
    • Untracked: .hfignore
  • Last recorded pre-validation command in terminal:
    • ./pre-val.sh https://sus-human-tracefix-rl.hf.space .
    • Exit code: 1

This file describes the current implementation in RL_ENV_FINAL only.

High-Level Architecture

  • environment.py: core gym-style state machine (TraceFixRLGym)
  • server/tracefix_rl_environment.py: OpenEnv adapter (Environment interface)
  • server/app.py: FastAPI app creation and uvicorn entrypoint
  • models.py: action/observation schemas (CodeAction, CodeObservation, TestResult)
  • sandbox.py: isolated code execution + test running + timeout handling
  • tasks.py: static task registry (easy/medium/hard)
  • context.py: localized context windowing around last edit
  • client.py: typed OpenEnv client (TraceFixRLEnv / MyEnv)
  • inference.py: baseline agent runner with OpenAI-compatible API
  • openenv.yaml: OpenEnv runtime metadata (app: server.app:app, port: 7860)

Runtime and Entry Points

  • Local server via project script:
    • uv run --project . server
  • Container command in Dockerfile:
    • uvicorn server.app:app --host 0.0.0.0 --port 7860
  • OpenEnv spec points to:
    • server.app:app

Environment Behavior (environment.py)

Action space:

  • VIEW_CODE
  • RUN_TESTS
  • REPLACE_LINES
  • UNDO_EDIT
  • RESET_TO_ORIGINAL
  • SUBMIT

Reward constants currently defined:

  • R_STEP_COST = -0.01
  • R_RUN_TESTS = +0.10
  • R_PER_NEW_PASS = +0.05
  • R_SYNTAX_ERROR = -0.10
  • R_INVALID_LINE = -0.02
  • R_DESTRUCTIVE_PENALTY = -0.20
  • R_UNDO_RESET = -0.10
  • MAX_STEPS = 50

Episode internals include:

  • code snapshotting (_original_code, _edit_history)
  • anti-loop penalty for repeated identical action_type
  • contextual anchor (_last_edited_line) for localized context
  • cumulative step-cost tracking (_accumulated_step_costs)

Submit scoring model:

  • proportion = passing_tests / total_tests (or 0 on syntax error)
  • raw_score = proportion - _accumulated_step_costs
  • final_score = clamp(raw_score, 0.0, 1.0)
  • same clamp model used on max-step timeout auto-evaluation

Task sampling policy:

  • training_step == 0: random from ALL_TASKS
  • < 1000: easy
  • < 5000: medium
  • >= 5000: hard
  • fallback to first non-empty bucket

Schema Notes (models.py)

Important: current code uses Pydantic v2-style validation APIs.

  • CodeAction uses @model_validator(mode="before")
  • Non-REPLACE_LINES actions force start_line, end_line, new_code_block to None
  • REPLACE_LINES enforces required fields and 1-indexed positive range constraints

This is not compatible with Pydantic v1-only assumptions.

Sandbox Notes (sandbox.py)

run_code_with_tests(...) returns a strict 3-tuple:

  • output_str
  • List[TestResult>
  • had_syntax_error: bool

Execution safeguards:

  • subprocess isolation via multiprocessing.Process
  • timeout terminate/kill path
  • tail truncation (MAX_OUTPUT_CHARS = 1000)
  • restricted builtins to block risky operations

Tasks Registry (tasks.py)

  • Static hardcoded registry grouped by difficulty
  • Exports:
    • TASKS_BY_DIFFICULTY
    • ALL_TASKS
  • Expected total currently: 16 tasks
    • easy: 4
    • medium: 6
    • hard: 6

OpenEnv Adapter and Client

server/tracefix_rl_environment.py:

  • Maps optional reset difficulty to training_step hints
  • Writes system_prompt into observation metadata
  • Sets observation reward/done from gym step output

client.py:

  • Sends actions using model_dump(exclude_none=True)
  • Parses OpenEnv payloads into typed CodeObservation

Inference Runner (inference.py)

Key defaults:

  • API_BASE_URL = https://router.huggingface.co/v1
  • MODEL_NAME = Qwen/Qwen2.5-72B-Instruct
  • MAX_STEPS = 50
  • SUCCESS_SCORE_THRESHOLD = 0.99
  • THINKING_TOKEN_LIMIT = 512

Behavior:

  • Logs in strict sequence: [START], repeated [STEP], then [END]
  • Uses JSON extraction fallback path from model text
  • Falls back to RUN_TESTS on parse or validation failure
  • Supports --easy, --medium, --hard, --debug

Drift and Risk Notes

  1. requirements.txt currently pins pydantic==1.10.17, but code in models.py uses v2 APIs (model_validator).
  2. pyproject.toml is the active dependency source for uv sync; requirements.txt appears stale relative to runtime assumptions.
  3. environment.py defines R_SUBMIT_ALL_PASS and R_SUBMIT_FAIL, but submit currently uses clamped proportion-minus-step-cost scoring instead of those constants.
  4. server/tracefix_rl_environment.py advertises concurrent sessions support, while create_app(..., max_concurrent_envs=1) constrains server-level concurrency.

Practical Checklist Before Validation

  1. Confirm dependency source of truth (pyproject.toml vs requirements.txt) and align Pydantic version expectations.
  2. Re-run pre-validation and capture the first failing check/output.
  3. Remove tracked cache artifacts from version control if unintended (for example __pycache__/*.pyc).
  4. Keep stdout format in inference.py unchanged, as validator parsing depends on it.