SWE-Review-30B-A3B
An agentic code review model fine-tuned from Qwen3-30B-A3B (MoE, 3B active parameters) on 8,914 review trajectories (SWE-Review-Traj). The model explores a repository via tool calls, independently traces the root cause of an issue, and produces a structured review decision with diagnostic feedback for revision.
Project Page | Paper | Code | Benchmark | Training Data | Claude Code Plugin
About SWE-Review
SWE-Review is a framework for closing the issue-resolution loop with agentic code review. A reviewer agent independently explores the repository, traces the root cause, and compares its own diagnosis against the submitted PR — turning one-shot patch generation into an iterative generate-review-revise loop that raises resolve rates by up to +29.4 percentage points on SWE-bench Verified.
Quick Start
Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
--model SWE-Lego/SWE-Review-30B-A3B \
--served-model-name SWE-Review-30B-A3B \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--chat-template-content-format string \
--api-key dummy-key
Run a Review
The model is designed to be used as an agentic reviewer with tool-calling (file reading, code search, etc.). See the SWE-Review code repository for the full agentic review pipeline, including Harbor-based orchestration and benchmark evaluation scripts.
For a plug-and-play experience in Claude Code, install the cc-swe-review plugin.
Evaluation Results
Performance on SWE-Review-Bench (1,384 instances across 3 quality tiers):
| Split | CR (%) | DA (%) | RRR (%) | Δ RR |
|---|---|---|---|---|
| GLM-5 (high, baseline 72.2%) | 82.0 | 69.0 | 72.6 | +0.4 |
| Coder-30B (medium, baseline 50.9%) | 83.1 | 70.5 | 53.7 | +2.8 |
| Qwen3-30B (low, baseline 27.5%) | 87.2 | 76.5 | 35.8 | +8.3 |
Improvement over Base Model (Qwen3-30B-A3B without SFT)
| Split | DA (base → SFT) | RRR (base → SFT) |
|---|---|---|
| GLM-5 (high) | 63.2 → 69.0 (+5.8) | 64.9 → 72.6 (+7.7) |
| Coder-30B (medium) | 54.5 → 70.5 (+16.0) | 49.1 → 53.7 (+4.6) |
| Qwen3-30B (low) | 45.9 → 76.5 (+30.6) | 28.2 → 35.8 (+7.6) |
SFT yields substantial improvements across all splits, with the most dramatic DA gain on the hardest split (+30.6pp). The 30B-A3B model consistently outperforms SWE-Review-8B while maintaining MoE inference efficiency (3B active parameters).
Test-Time Scaling
When used as the reviewer in iterative review-revision loops, the model enables 3.7× the test-time scaling gain at 6.7× the efficiency of independent resampling, reaching 38.4% resolve rate with only 2.44 samples on average.
Training Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-30B-A3B (MoE, 30B total / 3B active) |
| Training Data | SWE-Review-Traj (8,914 trajectories) |
| Framework | LLaMA-Factory + DeepSpeed ZeRO-3 |
| Epochs | 4 |
| Learning Rate | 1e-4 (cosine scheduler) |
| Max Context Length | 131,072 tokens (YaRN rope scaling) |
| Optimizations | Liger kernel + Flash Attention v2 |
Important Notes
- Use
--tool-call-parser hermesfor this SFT model (notqwen3_coder, which is for the base Qwen3-Coder model) - Set
OPENHANDS_LLM_NATIVE_TOOL_CALLING=truewhen using with OpenHands - Do not use
--reasoning-parserwith this model (causes tool calls to be hidden in<think>blocks) - TP=4 recommended for inference
- For Harbor agent execution, use
http://172.17.0.1:<port>/v1(Docker bridge gateway), notlocalhost
Citation
@article{wang2026swereview,
title={SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review},
author={Wang, Ruoyu and Chen, Jierun and Wang, Shaowei and Tao, Chaofan and Yang, Sidi and Jiang, Yuxin and Yap, Kim-Hui and Shang, Lifeng and Li, Xiaohui and Bai, Haoli},
journal={arXiv preprint},
year={2026}
}
- Downloads last month
- 13