Spaces:
Sleeping
title: TraceFix-RL
emoji: π§βπ»
colorFrom: blue
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
- openenv
- reinforcement-learning
- software-engineering
TraceFix-RL
TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior that looks like real software engineering work. Instead of one-shot answers, the agent must inspect code, form a hypothesis, run tests, patch the code, verify outcomes, and only then submit. The loop rewards disciplined debugging and penalizes random edits, forcing the model to learn an engineering workflow.
Core Design
- Action space:
VIEW_CODE,RUN_TESTS,REPLACE_LINES,UNDO_EDIT,RESET_TO_ORIGINAL,SUBMIT - Observations: The full code snapshot, localized edit context, execution output, syntax status, and per-test outcomes.
- Dense Rewards:
RUN_TESTSbonus, per-test progress bonus, step-cost penalty, invalid-edit penalties, and a final clamped score bounded within[0.01, 0.98]. - Curriculum-ready Tasks: Includes Easy, Medium, and Hard buckets that the OpenEnv trainer can sequence, alongside random fallback for evaluators.
State Machine Training Pattern
The environment prompt in environment.py encodes a strict operating pattern the agent is expected to follow:
- ORIENT: Inspect code (
VIEW_CODE) - DIAGNOSE: Run tests and read failures (
RUN_TESTS) - FIX: Patch one localized region (
REPLACE_LINES) - VERIFY: Rerun tests (
RUN_TESTS) - REPEAT: Continue until all failures are resolved
- SUBMIT: Finalize only after tests pass
This sequence naturally guides reinforcement learning toward robust planning, controlled editing, and verification behavior.
Task Tiers And Test Structure
The registry in tasks.py acts as a static curated set of coding challenges (16 tasks total):
- Easy (4 tasks): Focuses on basic operators, indexing, and simple string/array logic.
- Medium (6 tasks): Focuses on recursive behavior, branching correctness, and text normalization edges.
- Hard (6 tasks): Focuses on data-structure invariants, bracket mapping, interval merging, and eviction logic.
Every task contains: name, description, difficulty, bug_type, code (buggy implementation), solution, and executable tests. All tests are safely run inside isolated sandboxes via sandbox.py using multiprocessing.
Tech Stack & Project Files
This environment enforces strict typing and uses standard modern tooling:
uv: Handles dependency management (seepyproject.toml).- FastAPI: Provides the
server.appintegration layer for OpenEnv compliance. - Pydantic (v2): Provides strong validation layers for
models.py(e.g.,CodeAction,CodeObservation). - OpenEnv Config: See
openenv.yamlwhich specifiestracefix_rlto run the FastAPI app on port7860.
File Layout:
models.py/context.py: Domain and schema logic.tasks.py: Task metadata definitions.sandbox.py: Subprocess runtime and output tracking.environment.py: Reset/step/reward core RL loop logic (TraceFixRLGym).server/tracefix_rl_environment.py/server/app.py: Maps the OpenAI/OpenEnv network interface to the core environment.inference.py: Baseline OpenAI-client inference script to evaluate agents.
Local Development
You must install uv on your system.
# Sync dependencies
uv sync
# Run the OpenEnv server on port 7860
uv run --project . server
Server endpoints available:
POST /resetPOST /stepGET /healthWS /ws
Baseline Scores
Baseline scores are intended to be recorded from the bundled inference.py runner against the three validator tasks.
The current environment intentionally squashes scores into the open interval [0.01, 0.98], so benchmark output should be
reported with that convention in mind.
| Task | Baseline Score |
|---|---|
valid_parentheses_wrong_mapping |
Pending first benchmark run |
binary_search_off_by_one |
Pending first benchmark run |
reverse_string_returns_original |
Pending first benchmark run |
Docker + Hugging Face Spaces Deployment
The space runs via Docker. The container is securely configured to run as a non-root appuser (UID base 1000) for Spaces compliance.
Testing Locally in Docker
docker build -t tracefix-rl:test -f Dockerfile .
docker run --rm -p 7860:7860 tracefix-rl:test
Deploy to Hugging Face Spaces
This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.
# Push directly to your specified HF Space
openenv push
Server Pre-validation
Before committing to training, you can validate your deployed server or local space:
bash ./pre-val.sh https://<your-space>.hf.space .
Inference & Evaluation (inference.py)
The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.
Requirements for Inference:
API_BASE_URL(Defaults tohttps://router.huggingface.co/v1)MODEL_NAME(Defaults toQwen/Qwen2.5-72B-Instruct)HF_TOKEN
Usage Flags:
--easy,--medium,--hard: Lock the environment to a specific task bucket.--thought: Send<thought>token blocks back to the payload to train chain-of-thought capabilities.
Example execution tracking thoughts in medium tasks:
python inference.py --medium --thought