SWE-Review-8B
An agentic code review model fine-tuned from Qwen3-8B on 8,914 review trajectories (SWE-Review-Traj). The model explores a repository via tool calls, independently traces the root cause of an issue, and produces a structured review decision with diagnostic feedback for revision.
Project Page | Paper | Code | Benchmark | Training Data | Claude Code Plugin
About SWE-Review
SWE-Review is a framework for closing the issue-resolution loop with agentic code review. A reviewer agent independently explores the repository, traces the root cause, and compares its own diagnosis against the submitted PR — turning one-shot patch generation into an iterative generate-review-revise loop that raises resolve rates by up to +29.4 percentage points on SWE-bench Verified.
Quick Start
Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
--model SWE-Lego/SWE-Review-8B \
--served-model-name SWE-Review-8B \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--chat-template-content-format string \
--api-key dummy-key
Run a Review
The model is designed to be used as an agentic reviewer with tool-calling (file reading, code search, etc.). See the SWE-Review code repository for the full agentic review pipeline, including Harbor-based orchestration and benchmark evaluation scripts.
For a plug-and-play experience in Claude Code, install the cc-swe-review plugin.
Evaluation Results
Performance on SWE-Review-Bench (1,384 instances across 3 quality tiers):
| Split | CR (%) | DA (%) | RRR (%) | Δ RR |
|---|---|---|---|---|
| GLM-5 (high, baseline 72.2%) | 84.2 | 68.7 | 71.6 | -0.6 |
| Coder-30B (medium, baseline 50.9%) | 81.4 | 66.9 | 52.8 | +1.9 |
| Qwen3-30B (low, baseline 27.5%) | 71.1 | 71.6 | 35.1 | +7.6 |
Improvement over Base Model (Qwen3-8B without SFT)
| Split | DA (base → SFT) | RRR (base → SFT) |
|---|---|---|
| GLM-5 (high) | 49.0 → 68.7 (+19.7) | 72.2 → 71.6 (-0.6) |
| Coder-30B (medium) | 49.1 → 66.9 (+17.8) | 50.9 → 52.8 (+1.9) |
| Qwen3-30B (low) | 50.8 → 71.6 (+20.8) | 27.5 → 35.1 (+7.6) |
SFT dramatically improves DA across all splits. On harder PRs (medium/low tiers), review-guided revision further raises resolve rate.
Test-Time Scaling
When used as the reviewer in iterative review-revision loops, the model enables 3.7× the test-time scaling gain at 6.7× the efficiency of independent resampling, reaching 38.4% resolve rate with only 2.44 samples on average.
Training Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-8B |
| Training Data | SWE-Review-Traj (8,914 trajectories) |
| Framework | LLaMA-Factory + DeepSpeed ZeRO-3 |
| Epochs | 4 |
| Learning Rate | 1e-4 (cosine scheduler) |
| Max Context Length | 131,072 tokens (YaRN rope scaling) |
| Optimizations | Liger kernel + Flash Attention v2 |
Important Notes
- Use
--tool-call-parser hermes(notqwen3_coder) - Set
OPENHANDS_LLM_NATIVE_TOOL_CALLING=truewhen using with OpenHands - Do not use
--reasoning-parserwith this model - For Harbor agent execution, use
http://172.17.0.1:<port>/v1(Docker bridge gateway), notlocalhost
Citation
@article{wang2026swereview,
title={SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review},
author={Wang, Ruoyu and Chen, Jierun and Wang, Shaowei and Tao, Chaofan and Yang, Sidi and Jiang, Yuxin and Yap, Kim-Hui and Shang, Lifeng and Li, Xiaohui and Bai, Haoli},
journal={arXiv preprint},
year={2026}
}
- Downloads last month
- 26