ashishbaberwal's picture
New Final
684f052
metadata
title: Code Review Agent Environment
emoji: 🤖
colorFrom: green
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Code Review Agent Environment

CI

This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.

Judge Summary

  • OpenEnv validation: pass
  • Tests: pass
  • Docker build: pass
  • Baseline reproduction: pass
  • Live Space health/reset: pass

Evidence:

Why This Environment

Code review is a strong RL task because success and failure are measurable: line-level issues can be deterministically graded, rewards can be shaped across review phases, and tasks can scale from easy to hard while staying realistic.

This project is designed for both evaluation and lightweight policy training loops, not only one-off scripted inference.

The agent receives a code diff and surrounding file context, then performs a multi-step review:

  1. Add issue comments with line numbers.
  2. Suggest code fixes.
  3. Make a final decision (approved or changes_requested).

The environment scores the review quality using deterministic graders.

What This Project Does

  • Simulates pull-request review tasks across easy/medium/hard difficulty.
  • Exposes OpenEnv-style lifecycle methods (reset, step, state).
  • Exposes integration endpoints (tasks, score, health) for tooling and dashboard checks.
  • Grades issue detection, fix suggestions, and final decision quality.
  • Supports local LLM providers via an OpenAI-compatible API (including Ollama).
  • Includes a policy-training scaffold (train.py, train_env.py) and logged training metrics.

Project Structure

  • environment/: environment implementation, task definitions, models, and grading logic.
  • inference.py: baseline review agent loop.
  • train.py, train_env.py: lightweight PPO-style policy training loop over the environment.
  • ppo_logs/: training metrics and summaries.
  • openenv.yaml: task registry and environment metadata.
  • tests/: environment tests.
  • explore_env.ipynb: interactive environment walkthrough.
  • docker-compose.yml / Dockerfile: containerized execution options.

Prerequisites

  • Python 3.10+
  • macOS/Linux shell or PowerShell equivalent
  • Optional: Docker Desktop
  • Optional: Ollama for local model inference

Local Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Required Environment Variables

The baseline uses OpenAI-compatible endpoints.

  • API_BASE_URL (required)
  • MODEL_NAME (required)
  • HF_TOKEN (preferred auth var)

Supported auth aliases:

  • OPENAI_API_KEY
  • API_KEY

Run Methods

1) Run Unit Tests

source .venv/bin/activate
pytest tests/test_env.py -q

2) Validate OpenEnv Package

source .venv/bin/activate
openenv validate

3) Run Baseline Agent (Single Task)

source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

python inference.py \
    --task-id bug_detection_easy_1 \
    --max-steps 10 \
    --output baseline_results.json

4) Run All Tasks (Local Sweep)

source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

for task in \
    bug_detection_easy_1 \
    bug_detection_easy_2 \
    approve_easy_3 \
    memory_leak_medium_1 \
    performance_medium_2 \
    approve_medium_3 \
    security_hard_1 \
    race_condition_hard_2 \
    approve_hard_3
do
    python inference.py --task-id "$task" --max-steps 10 --output "baseline_${task}.json"
done

5) Docker Build and Run

docker build -t code-review-env .

docker run --rm \
    -e API_BASE_URL=http://host.docker.internal:11434/v1 \
    -e MODEL_NAME=qwen3.5:latest \
    -e HF_TOKEN=not-needed \
    -e TEMPERATURE=0.0 \
    -e REQUEST_TIMEOUT=180 \
    code-review-env \
    --task-id bug_detection_easy_1

6) Docker Compose Services

docker compose run --rm openai-agent
docker compose run --rm gemini-agent
docker compose run --rm groq-agent
docker compose run --rm local-agent

Note: on macOS, network_mode: host can be unreliable. If local-agent cannot reach Ollama, use host.docker.internal in the service environment.

Available Task IDs

  • bug_detection_easy_1
  • bug_detection_easy_2
  • approve_easy_3
  • memory_leak_medium_1
  • performance_medium_2
  • approve_medium_3
  • type_safety_medium_4
  • javascript_medium_5
  • security_hard_1
  • race_condition_hard_2
  • approve_hard_3
  • adversarial_hard_4
  • concurrency_hard_5
  • dependency_injection_hard_6

HTTP Endpoints

  • GET /
  • GET /health
  • GET /tasks
  • GET|POST /reset
  • POST /step
  • GET /state
  • GET /score

Output Format

Each inference run writes JSON like:

{
    "task_id": "bug_detection_easy_1",
    "total_reward": 0.78,
    "task_score": 1.0,
    "steps": 3,
    "max_steps": 10,
    "provider": "openai-client",
    "model": "qwen3.5:latest",
    "api_base_url": "http://localhost:11434/v1"
}

Notes On Baseline Stability

  • Local models can time out on long prompts.
  • The baseline now enforces phased review behavior and falls back to deterministic actions when the model is temporarily unavailable.
  • For reproducible runs, keep TEMPERATURE=0.0.

Fast Start (3 Commands)

source .venv/bin/activate
pytest -q
python submit.py --skip-docker --max-steps 10

Judge Map (Criterion -> Evidence)

Criterion Evidence File
OpenEnv lifecycle compliance reset/step/state implemented and served over HTTP environment/env.py, server/app.py
Typed models Pydantic action/state/observation models environment/models.py
Task difficulty progression easy/medium/hard tasks + calibration approve tasks environment/tasks.py
Grading quality detection/suggestion/decision + partial credit + FP penalty + efficiency bonus environment/graders.py
Baseline reproducibility deterministic seed support in reset + inference output metadata environment/env.py, inference.py
Submission validation Python preflight + bash validator script submit.py, scripts/validate-submission.sh

Grader Rubric (Summary)

Component Weight / Effect Notes
Detection score 0.4 Partial credit for near-line matches
Suggestion score 0.3 Line-proximity matching for fixes
Decision score 0.3 Approve for no-issue tasks, request_changes otherwise
False positive penalty up to -0.4 Strong penalty for issue spam
Efficiency bonus up to +0.1 Bonus for completing in fewer steps
Final score clamp [0,1] Safety clamp in grader

Benchmark Snapshot (3-Task Local Run)

Task Task Score Total Reward Model
bug_detection_easy_1 1.000 1.410 meta/llama-3.3-70b-instruct
memory_leak_medium_1 0.875 1.285 meta/llama-3.3-70b-instruct
security_hard_1 1.000 1.410 meta/llama-3.3-70b-instruct

Note: task_score is normalized to [0,1]. total_reward is cumulative step reward and can exceed 1.0 by design.

Training Results (PPO-style Loop)

Run training:

source .venv/bin/activate
python train.py --episodes 120 --max-steps 5

Generated artifacts:

  • ppo_logs/train_metrics.csv
  • ppo_logs/summary.txt

Recent run summary:

  • Episodes: 120
  • Average reward (first 10): 0.0100
  • Average reward (last 10): 0.5100
  • Improvement: +0.5000

This demonstrates measurable policy improvement under the training setup provided in this repository.

One-Command Benchmark Table

Generate per-task JSON outputs plus a markdown table for judge submission:

source .venv/bin/activate
python scripts/run_benchmark.py --max-steps 10

Artifacts:

  • outputs/benchmark_<task_id>.json
  • outputs/benchmark_table.md

Failure Analysis Template

  1. javascript_medium_5 (Undefined access)
  • Observation: task score reached 1.0, but diagnostics show precision=0.5, recall=1.0, f1=0.6667, false_positive_count=1.
  • Why: model used Python-centric heuristics and produced one extra issue comment on a JS snippet.
  • Action: added JavaScript task category and retained false-positive penalties to expose over-flagging.
  1. memory_leak_medium_1 (historical baseline run)
  • Observation: earlier run dropped below perfect score due to noisy comment strategy.
  • Why: over-commenting triggered false positive penalties despite finding the core issue.
  • Action: anti-loop repeated-comment penalty + adversarial no-issue tasks to discourage spam.
  1. adversarial_hard_4 (Safe SQL task)
  • Observation: correct behavior is approve; naive SQL keyword matching causes false alarms.
  • Why: keyword-only review policies confuse parameterized SQL with vulnerable string interpolation.
  • Action: included explicit no-issue adversarial task in hard set and calibration tests to reward restraint.