Spaces:

Spirit-26
/

code-review-environment

Running

App Files Files Community

code-review-environment / README.md

ashishbaberwal

New Final

684f052 10 days ago

preview code

raw

history blame contribute delete

9.37 kB

metadata

title: Code Review Agent Environment
emoji: 🤖
colorFrom: green
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Code Review Agent Environment

This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.

Judge Summary

OpenEnv validation: pass
Tests: pass
Docker build: pass
Baseline reproduction: pass
Live Space health/reset: pass

Evidence:

Why This Environment

Code review is a strong RL task because success and failure are measurable: line-level issues can be deterministically graded, rewards can be shaped across review phases, and tasks can scale from easy to hard while staying realistic.

This project is designed for both evaluation and lightweight policy training loops, not only one-off scripted inference.

The agent receives a code diff and surrounding file context, then performs a multi-step review:

Add issue comments with line numbers.
Suggest code fixes.
Make a final decision (approved or changes_requested).

The environment scores the review quality using deterministic graders.

What This Project Does

Simulates pull-request review tasks across easy/medium/hard difficulty.
Exposes OpenEnv-style lifecycle methods (reset, step, state).
Exposes integration endpoints (tasks, score, health) for tooling and dashboard checks.
Grades issue detection, fix suggestions, and final decision quality.
Supports local LLM providers via an OpenAI-compatible API (including Ollama).
Includes a policy-training scaffold (train.py, train_env.py) and logged training metrics.

Project Structure

environment/: environment implementation, task definitions, models, and grading logic.
inference.py: baseline review agent loop.
train.py, train_env.py: lightweight PPO-style policy training loop over the environment.
ppo_logs/: training metrics and summaries.
openenv.yaml: task registry and environment metadata.
tests/: environment tests.
explore_env.ipynb: interactive environment walkthrough.
docker-compose.yml / Dockerfile: containerized execution options.

Prerequisites

Python 3.10+
macOS/Linux shell or PowerShell equivalent
Optional: Docker Desktop
Optional: Ollama for local model inference

Local Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Required Environment Variables

The baseline uses OpenAI-compatible endpoints.

API_BASE_URL (required)
MODEL_NAME (required)
HF_TOKEN (preferred auth var)

Supported auth aliases:

OPENAI_API_KEY
API_KEY

Run Methods

1) Run Unit Tests

source .venv/bin/activate
pytest tests/test_env.py -q

2) Validate OpenEnv Package

source .venv/bin/activate
openenv validate

3) Run Baseline Agent (Single Task)

source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

python inference.py \
    --task-id bug_detection_easy_1 \
    --max-steps 10 \
    --output baseline_results.json

4) Run All Tasks (Local Sweep)

source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

for task in \
    bug_detection_easy_1 \
    bug_detection_easy_2 \
    approve_easy_3 \
    memory_leak_medium_1 \
    performance_medium_2 \
    approve_medium_3 \
    security_hard_1 \
    race_condition_hard_2 \
    approve_hard_3
do
    python inference.py --task-id "$task" --max-steps 10 --output "baseline_${task}.json"
done

5) Docker Build and Run

docker build -t code-review-env .

docker run --rm \
    -e API_BASE_URL=http://host.docker.internal:11434/v1 \
    -e MODEL_NAME=qwen3.5:latest \
    -e HF_TOKEN=not-needed \
    -e TEMPERATURE=0.0 \
    -e REQUEST_TIMEOUT=180 \
    code-review-env \
    --task-id bug_detection_easy_1

6) Docker Compose Services

docker compose run --rm openai-agent
docker compose run --rm gemini-agent
docker compose run --rm groq-agent
docker compose run --rm local-agent

Note: on macOS, network_mode: host can be unreliable. If local-agent cannot reach Ollama, use host.docker.internal in the service environment.

Available Task IDs

bug_detection_easy_1
bug_detection_easy_2
approve_easy_3
memory_leak_medium_1
performance_medium_2
approve_medium_3
type_safety_medium_4
javascript_medium_5
security_hard_1
race_condition_hard_2
approve_hard_3
adversarial_hard_4
concurrency_hard_5
dependency_injection_hard_6

HTTP Endpoints

GET /
GET /health
GET /tasks
GET|POST /reset
POST /step
GET /state
GET /score

Output Format

Each inference run writes JSON like:

{
    "task_id": "bug_detection_easy_1",
    "total_reward": 0.78,
    "task_score": 1.0,
    "steps": 3,
    "max_steps": 10,
    "provider": "openai-client",
    "model": "qwen3.5:latest",
    "api_base_url": "http://localhost:11434/v1"
}

Notes On Baseline Stability

Local models can time out on long prompts.
The baseline now enforces phased review behavior and falls back to deterministic actions when the model is temporarily unavailable.
For reproducible runs, keep TEMPERATURE=0.0.

Fast Start (3 Commands)

source .venv/bin/activate
pytest -q
python submit.py --skip-docker --max-steps 10

Judge Map (Criterion -> Evidence)

Criterion	Evidence	File
OpenEnv lifecycle compliance	reset/step/state implemented and served over HTTP	`environment/env.py`, `server/app.py`
Typed models	Pydantic action/state/observation models	`environment/models.py`
Task difficulty progression	easy/medium/hard tasks + calibration approve tasks	`environment/tasks.py`
Grading quality	detection/suggestion/decision + partial credit + FP penalty + efficiency bonus	`environment/graders.py`
Baseline reproducibility	deterministic seed support in reset + inference output metadata	`environment/env.py`, `inference.py`
Submission validation	Python preflight + bash validator script	`submit.py`, `scripts/validate-submission.sh`

Grader Rubric (Summary)

Component	Weight / Effect	Notes
Detection score	0.4	Partial credit for near-line matches
Suggestion score	0.3	Line-proximity matching for fixes
Decision score	0.3	Approve for no-issue tasks, request_changes otherwise
False positive penalty	up to -0.4	Strong penalty for issue spam
Efficiency bonus	up to +0.1	Bonus for completing in fewer steps
Final score clamp	[0,1]	Safety clamp in grader

Benchmark Snapshot (3-Task Local Run)

Task	Task Score	Total Reward	Model
bug_detection_easy_1	1.000	1.410	meta/llama-3.3-70b-instruct
memory_leak_medium_1	0.875	1.285	meta/llama-3.3-70b-instruct
security_hard_1	1.000	1.410	meta/llama-3.3-70b-instruct

Note: task_score is normalized to [0,1]. total_reward is cumulative step reward and can exceed 1.0 by design.

Training Results (PPO-style Loop)

Run training:

source .venv/bin/activate
python train.py --episodes 120 --max-steps 5

Generated artifacts:

ppo_logs/train_metrics.csv
ppo_logs/summary.txt

Recent run summary:

Episodes: 120
Average reward (first 10): 0.0100
Average reward (last 10): 0.5100
Improvement: +0.5000

This demonstrates measurable policy improvement under the training setup provided in this repository.

One-Command Benchmark Table

Generate per-task JSON outputs plus a markdown table for judge submission:

source .venv/bin/activate
python scripts/run_benchmark.py --max-steps 10

Artifacts:

outputs/benchmark_<task_id>.json
outputs/benchmark_table.md

Failure Analysis Template

javascript_medium_5 (Undefined access)

Observation: task score reached 1.0, but diagnostics show precision=0.5, recall=1.0, f1=0.6667, false_positive_count=1.
Why: model used Python-centric heuristics and produced one extra issue comment on a JS snippet.
Action: added JavaScript task category and retained false-positive penalties to expose over-flagging.

memory_leak_medium_1 (historical baseline run)

Observation: earlier run dropped below perfect score due to noisy comment strategy.
Why: over-commenting triggered false positive penalties despite finding the core issue.
Action: anti-loop repeated-comment penalty + adversarial no-issue tasks to discourage spam.

adversarial_hard_4 (Safe SQL task)

Observation: correct behavior is approve; naive SQL keyword matching causes false alarms.
Why: keyword-only review policies confuse parameterized SQL with vulnerable string interpolation.
Action: included explicit no-issue adversarial task in hard set and calibration tests to reward restraint.