title: Code Review Agent Environment
emoji: 🤖
colorFrom: green
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
Code Review Agent Environment
This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.
Judge Summary
- OpenEnv validation: pass
- Tests: pass
- Docker build: pass
- Baseline reproduction: pass
- Live Space health/reset: pass
Evidence:
Why This Environment
Code review is a strong RL task because success and failure are measurable: line-level issues can be deterministically graded, rewards can be shaped across review phases, and tasks can scale from easy to hard while staying realistic.
This project is designed for both evaluation and lightweight policy training loops, not only one-off scripted inference.
The agent receives a code diff and surrounding file context, then performs a multi-step review:
- Add issue comments with line numbers.
- Suggest code fixes.
- Make a final decision (
approvedorchanges_requested).
The environment scores the review quality using deterministic graders.
What This Project Does
- Simulates pull-request review tasks across easy/medium/hard difficulty.
- Exposes OpenEnv-style lifecycle methods (
reset,step,state). - Exposes integration endpoints (
tasks,score,health) for tooling and dashboard checks. - Grades issue detection, fix suggestions, and final decision quality.
- Supports local LLM providers via an OpenAI-compatible API (including Ollama).
- Includes a policy-training scaffold (
train.py,train_env.py) and logged training metrics.
Project Structure
environment/: environment implementation, task definitions, models, and grading logic.inference.py: baseline review agent loop.train.py,train_env.py: lightweight PPO-style policy training loop over the environment.ppo_logs/: training metrics and summaries.openenv.yaml: task registry and environment metadata.tests/: environment tests.explore_env.ipynb: interactive environment walkthrough.docker-compose.yml/Dockerfile: containerized execution options.
Prerequisites
- Python 3.10+
- macOS/Linux shell or PowerShell equivalent
- Optional: Docker Desktop
- Optional: Ollama for local model inference
Local Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Required Environment Variables
The baseline uses OpenAI-compatible endpoints.
API_BASE_URL(required)MODEL_NAME(required)HF_TOKEN(preferred auth var)
Supported auth aliases:
OPENAI_API_KEYAPI_KEY
Run Methods
1) Run Unit Tests
source .venv/bin/activate
pytest tests/test_env.py -q
2) Validate OpenEnv Package
source .venv/bin/activate
openenv validate
3) Run Baseline Agent (Single Task)
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180
python inference.py \
--task-id bug_detection_easy_1 \
--max-steps 10 \
--output baseline_results.json
4) Run All Tasks (Local Sweep)
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180
for task in \
bug_detection_easy_1 \
bug_detection_easy_2 \
approve_easy_3 \
memory_leak_medium_1 \
performance_medium_2 \
approve_medium_3 \
security_hard_1 \
race_condition_hard_2 \
approve_hard_3
do
python inference.py --task-id "$task" --max-steps 10 --output "baseline_${task}.json"
done
5) Docker Build and Run
docker build -t code-review-env .
docker run --rm \
-e API_BASE_URL=http://host.docker.internal:11434/v1 \
-e MODEL_NAME=qwen3.5:latest \
-e HF_TOKEN=not-needed \
-e TEMPERATURE=0.0 \
-e REQUEST_TIMEOUT=180 \
code-review-env \
--task-id bug_detection_easy_1
6) Docker Compose Services
docker compose run --rm openai-agent
docker compose run --rm gemini-agent
docker compose run --rm groq-agent
docker compose run --rm local-agent
Note: on macOS, network_mode: host can be unreliable. If local-agent cannot reach Ollama, use host.docker.internal in the service environment.
Available Task IDs
bug_detection_easy_1bug_detection_easy_2approve_easy_3memory_leak_medium_1performance_medium_2approve_medium_3type_safety_medium_4javascript_medium_5security_hard_1race_condition_hard_2approve_hard_3adversarial_hard_4concurrency_hard_5dependency_injection_hard_6
HTTP Endpoints
GET /GET /healthGET /tasksGET|POST /resetPOST /stepGET /stateGET /score
Output Format
Each inference run writes JSON like:
{
"task_id": "bug_detection_easy_1",
"total_reward": 0.78,
"task_score": 1.0,
"steps": 3,
"max_steps": 10,
"provider": "openai-client",
"model": "qwen3.5:latest",
"api_base_url": "http://localhost:11434/v1"
}
Notes On Baseline Stability
- Local models can time out on long prompts.
- The baseline now enforces phased review behavior and falls back to deterministic actions when the model is temporarily unavailable.
- For reproducible runs, keep
TEMPERATURE=0.0.
Fast Start (3 Commands)
source .venv/bin/activate
pytest -q
python submit.py --skip-docker --max-steps 10
Judge Map (Criterion -> Evidence)
| Criterion | Evidence | File |
|---|---|---|
| OpenEnv lifecycle compliance | reset/step/state implemented and served over HTTP | environment/env.py, server/app.py |
| Typed models | Pydantic action/state/observation models | environment/models.py |
| Task difficulty progression | easy/medium/hard tasks + calibration approve tasks | environment/tasks.py |
| Grading quality | detection/suggestion/decision + partial credit + FP penalty + efficiency bonus | environment/graders.py |
| Baseline reproducibility | deterministic seed support in reset + inference output metadata | environment/env.py, inference.py |
| Submission validation | Python preflight + bash validator script | submit.py, scripts/validate-submission.sh |
Grader Rubric (Summary)
| Component | Weight / Effect | Notes |
|---|---|---|
| Detection score | 0.4 | Partial credit for near-line matches |
| Suggestion score | 0.3 | Line-proximity matching for fixes |
| Decision score | 0.3 | Approve for no-issue tasks, request_changes otherwise |
| False positive penalty | up to -0.4 | Strong penalty for issue spam |
| Efficiency bonus | up to +0.1 | Bonus for completing in fewer steps |
| Final score clamp | [0,1] | Safety clamp in grader |
Benchmark Snapshot (3-Task Local Run)
| Task | Task Score | Total Reward | Model |
|---|---|---|---|
| bug_detection_easy_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |
| memory_leak_medium_1 | 0.875 | 1.285 | meta/llama-3.3-70b-instruct |
| security_hard_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |
Note: task_score is normalized to [0,1]. total_reward is cumulative step reward and can exceed 1.0 by design.
Training Results (PPO-style Loop)
Run training:
source .venv/bin/activate
python train.py --episodes 120 --max-steps 5
Generated artifacts:
ppo_logs/train_metrics.csvppo_logs/summary.txt
Recent run summary:
- Episodes:
120 - Average reward (first 10):
0.0100 - Average reward (last 10):
0.5100 - Improvement:
+0.5000
This demonstrates measurable policy improvement under the training setup provided in this repository.
One-Command Benchmark Table
Generate per-task JSON outputs plus a markdown table for judge submission:
source .venv/bin/activate
python scripts/run_benchmark.py --max-steps 10
Artifacts:
outputs/benchmark_<task_id>.jsonoutputs/benchmark_table.md
Failure Analysis Template
javascript_medium_5(Undefined access)
- Observation: task score reached
1.0, but diagnostics showprecision=0.5,recall=1.0,f1=0.6667,false_positive_count=1. - Why: model used Python-centric heuristics and produced one extra issue comment on a JS snippet.
- Action: added JavaScript task category and retained false-positive penalties to expose over-flagging.
memory_leak_medium_1(historical baseline run)
- Observation: earlier run dropped below perfect score due to noisy comment strategy.
- Why: over-commenting triggered false positive penalties despite finding the core issue.
- Action: anti-loop repeated-comment penalty + adversarial no-issue tasks to discourage spam.
adversarial_hard_4(Safe SQL task)
- Observation: correct behavior is approve; naive SQL keyword matching causes false alarms.
- Why: keyword-only review policies confuse parameterized SQL with vulnerable string interpolation.
- Action: included explicit no-issue adversarial task in hard set and calibration tests to reward restraint.