Spaces:

SolusOps
/

tracefix_rl

Sleeping

App Files Files Community

tracefix_rl / README.md

databoysu

active graders

7266968 29 days ago

preview code

raw

history blame contribute delete

5.48 kB

metadata

title: TraceFix-RL
emoji: 🧑‍💻
colorFrom: blue
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
  - openenv
  - reinforcement-learning
  - software-engineering

TraceFix-RL

TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior that looks like real software engineering work. Instead of one-shot answers, the agent must inspect code, form a hypothesis, run tests, patch the code, verify outcomes, and only then submit. The loop rewards disciplined debugging and penalizes random edits, forcing the model to learn an engineering workflow.

Core Design

Action space: VIEW_CODE, RUN_TESTS, REPLACE_LINES, UNDO_EDIT, RESET_TO_ORIGINAL, SUBMIT
Observations: The full code snapshot, localized edit context, execution output, syntax status, and per-test outcomes.
Dense Rewards: RUN_TESTS bonus, per-test progress bonus, step-cost penalty, invalid-edit penalties, and a final clamped score bounded within [0.01, 0.98].
Curriculum-ready Tasks: Includes Easy, Medium, and Hard buckets that the OpenEnv trainer can sequence, alongside random fallback for evaluators.

State Machine Training Pattern

The environment prompt in environment.py encodes a strict operating pattern the agent is expected to follow:

ORIENT: Inspect code (VIEW_CODE)
DIAGNOSE: Run tests and read failures (RUN_TESTS)
FIX: Patch one localized region (REPLACE_LINES)
VERIFY: Rerun tests (RUN_TESTS)
REPEAT: Continue until all failures are resolved
SUBMIT: Finalize only after tests pass

This sequence naturally guides reinforcement learning toward robust planning, controlled editing, and verification behavior.

Task Tiers And Test Structure

The registry in tasks.py acts as a static curated set of coding challenges (16 tasks total):

Easy (4 tasks): Focuses on basic operators, indexing, and simple string/array logic.
Medium (6 tasks): Focuses on recursive behavior, branching correctness, and text normalization edges.
Hard (6 tasks): Focuses on data-structure invariants, bracket mapping, interval merging, and eviction logic.

Every task contains: name, description, difficulty, bug_type, code (buggy implementation), solution, and executable tests. All tests are safely run inside isolated sandboxes via sandbox.py using multiprocessing.

Tech Stack & Project Files

This environment enforces strict typing and uses standard modern tooling:

uv: Handles dependency management (see pyproject.toml).
FastAPI: Provides the server.app integration layer for OpenEnv compliance.
Pydantic (v2): Provides strong validation layers for models.py (e.g., CodeAction, CodeObservation).
OpenEnv Config: See openenv.yaml which specifies tracefix_rl to run the FastAPI app on port 7860.

File Layout:

models.py / context.py: Domain and schema logic.
tasks.py: Task metadata definitions.
sandbox.py: Subprocess runtime and output tracking.
environment.py: Reset/step/reward core RL loop logic (TraceFixRLGym).
server/tracefix_rl_environment.py / server/app.py: Maps the OpenAI/OpenEnv network interface to the core environment.
inference.py: Baseline OpenAI-client inference script to evaluate agents.

Local Development

You must install uv on your system.

# Sync dependencies
uv sync

# Run the OpenEnv server on port 7860
uv run --project . server

Server endpoints available:

POST /reset
POST /step
GET /health
WS /ws

Baseline Scores

Baseline scores are intended to be recorded from the bundled inference.py runner against the three validator tasks. The current environment intentionally squashes scores into the open interval [0.01, 0.98], so benchmark output should be reported with that convention in mind.

Task	Baseline Score
`valid_parentheses_wrong_mapping`	Pending first benchmark run
`binary_search_off_by_one`	Pending first benchmark run
`reverse_string_returns_original`	Pending first benchmark run

Docker + Hugging Face Spaces Deployment

The space runs via Docker. The container is securely configured to run as a non-root appuser (UID base 1000) for Spaces compliance.

Testing Locally in Docker

docker build -t tracefix-rl:test -f Dockerfile .
docker run --rm -p 7860:7860 tracefix-rl:test

Deploy to Hugging Face Spaces

This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.

# Push directly to your specified HF Space
openenv push

Server Pre-validation

Before committing to training, you can validate your deployed server or local space:

bash ./pre-val.sh https://<your-space>.hf.space .

Inference & Evaluation (`inference.py`)

The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.

Requirements for Inference:

API_BASE_URL (Defaults to https://router.huggingface.co/v1)
MODEL_NAME (Defaults to Qwen/Qwen2.5-72B-Instruct)
HF_TOKEN

Usage Flags:

--easy, --medium, --hard: Lock the environment to a specific task bucket.
--thought: Send <thought> token blocks back to the payload to train chain-of-thought capabilities.

Example execution tracking thoughts in medium tasks:

python inference.py --medium --thought