SWE-Review-8B

An agentic code review model fine-tuned from Qwen3-8B on 8,914 review trajectories (SWE-Review-Traj). The model explores a repository via tool calls, independently traces the root cause of an issue, and produces a structured review decision with diagnostic feedback for revision.

About SWE-Review

SWE-Review is a framework for closing the issue-resolution loop with agentic code review. A reviewer agent independently explores the repository, traces the root cause, and compares its own diagnosis against the submitted PR — turning one-shot patch generation into an iterative generate-review-revise loop that raises resolve rates by up to +29.4 percentage points on SWE-bench Verified.

Quick Start

Serve with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model SWE-Lego/SWE-Review-8B \
    --served-model-name SWE-Review-8B \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 131072 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --chat-template-content-format string \
    --api-key dummy-key

Run a Review

The model is designed to be used as an agentic reviewer with tool-calling (file reading, code search, etc.). See the SWE-Review code repository for the full agentic review pipeline, including Harbor-based orchestration and benchmark evaluation scripts.

For a plug-and-play experience in Claude Code, install the cc-swe-review plugin.

Evaluation Results

Performance on SWE-Review-Bench (1,384 instances across 3 quality tiers):

Split	CR (%)	DA (%)	RRR (%)	Δ RR
GLM-5 (high, baseline 72.2%)	84.2	68.7	71.6	-0.6
Coder-30B (medium, baseline 50.9%)	81.4	66.9	52.8	+1.9
Qwen3-30B (low, baseline 27.5%)	71.1	71.6	35.1	+7.6

Improvement over Base Model (Qwen3-8B without SFT)

Split	DA (base → SFT)	RRR (base → SFT)
GLM-5 (high)	49.0 → 68.7 (+19.7)	72.2 → 71.6 (-0.6)
Coder-30B (medium)	49.1 → 66.9 (+17.8)	50.9 → 52.8 (+1.9)
Qwen3-30B (low)	50.8 → 71.6 (+20.8)	27.5 → 35.1 (+7.6)

SFT dramatically improves DA across all splits. On harder PRs (medium/low tiers), review-guided revision further raises resolve rate.

Test-Time Scaling

When used as the reviewer in iterative review-revision loops, the model enables 3.7× the test-time scaling gain at 6.7× the efficiency of independent resampling, reaching 38.4% resolve rate with only 2.44 samples on average.

Training Details

Property	Value
Base Model	Qwen/Qwen3-8B
Training Data	SWE-Review-Traj (8,914 trajectories)
Framework	LLaMA-Factory + DeepSpeed ZeRO-3
Epochs	4
Learning Rate	1e-4 (cosine scheduler)
Max Context Length	131,072 tokens (YaRN rope scaling)
Optimizations	Liger kernel + Flash Attention v2

Important Notes

Use --tool-call-parser hermes (not qwen3_coder)
Set OPENHANDS_LLM_NATIVE_TOOL_CALLING=true when using with OpenHands
Do not use --reasoning-parser with this model
For Harbor agent execution, use http://172.17.0.1:<port>/v1 (Docker bridge gateway), not localhost

Citation

@article{wang2026swereview,
  title={SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review},
  author={Wang, Ruoyu and Chen, Jierun and Wang, Shaowei and Tao, Chaofan and Yang, Sidi and Jiang, Yuxin and Yap, Kim-Hui and Shang, Lifeng and Li, Xiaohui and Bai, Haoli},
  journal={arXiv preprint},
  year={2026}
}