SWE-Review-30B-A3B

An agentic code review model fine-tuned from Qwen3-30B-A3B (MoE, 3B active parameters) on 8,914 review trajectories (SWE-Review-Traj). The model explores a repository via tool calls, independently traces the root cause of an issue, and produces a structured review decision with diagnostic feedback for revision.

About SWE-Review

SWE-Review is a framework for closing the issue-resolution loop with agentic code review. A reviewer agent independently explores the repository, traces the root cause, and compares its own diagnosis against the submitted PR — turning one-shot patch generation into an iterative generate-review-revise loop that raises resolve rates by up to +29.4 percentage points on SWE-bench Verified.

Quick Start

Serve with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model SWE-Lego/SWE-Review-30B-A3B \
    --served-model-name SWE-Review-30B-A3B \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 131072 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --chat-template-content-format string \
    --api-key dummy-key

Run a Review

The model is designed to be used as an agentic reviewer with tool-calling (file reading, code search, etc.). See the SWE-Review code repository for the full agentic review pipeline, including Harbor-based orchestration and benchmark evaluation scripts.

For a plug-and-play experience in Claude Code, install the cc-swe-review plugin.

Evaluation Results

Performance on SWE-Review-Bench (1,384 instances across 3 quality tiers):

Split	CR (%)	DA (%)	RRR (%)	Δ RR
GLM-5 (high, baseline 72.2%)	82.0	69.0	72.6	+0.4
Coder-30B (medium, baseline 50.9%)	83.1	70.5	53.7	+2.8
Qwen3-30B (low, baseline 27.5%)	87.2	76.5	35.8	+8.3

Improvement over Base Model (Qwen3-30B-A3B without SFT)

Split	DA (base → SFT)	RRR (base → SFT)
GLM-5 (high)	63.2 → 69.0 (+5.8)	64.9 → 72.6 (+7.7)
Coder-30B (medium)	54.5 → 70.5 (+16.0)	49.1 → 53.7 (+4.6)
Qwen3-30B (low)	45.9 → 76.5 (+30.6)	28.2 → 35.8 (+7.6)

SFT yields substantial improvements across all splits, with the most dramatic DA gain on the hardest split (+30.6pp). The 30B-A3B model consistently outperforms SWE-Review-8B while maintaining MoE inference efficiency (3B active parameters).

Test-Time Scaling

When used as the reviewer in iterative review-revision loops, the model enables 3.7× the test-time scaling gain at 6.7× the efficiency of independent resampling, reaching 38.4% resolve rate with only 2.44 samples on average.

Training Details

Property	Value
Base Model	Qwen/Qwen3-30B-A3B (MoE, 30B total / 3B active)
Training Data	SWE-Review-Traj (8,914 trajectories)
Framework	LLaMA-Factory + DeepSpeed ZeRO-3
Epochs	4
Learning Rate	1e-4 (cosine scheduler)
Max Context Length	131,072 tokens (YaRN rope scaling)
Optimizations	Liger kernel + Flash Attention v2

Important Notes

Use --tool-call-parser hermes for this SFT model (not qwen3_coder, which is for the base Qwen3-Coder model)
Set OPENHANDS_LLM_NATIVE_TOOL_CALLING=true when using with OpenHands
Do not use --reasoning-parser with this model (causes tool calls to be hidden in <think> blocks)
TP=4 recommended for inference
For Harbor agent execution, use http://172.17.0.1:<port>/v1 (Docker bridge gateway), not localhost

Citation

@article{wang2026swereview,
  title={SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review},
  author={Wang, Ruoyu and Chen, Jierun and Wang, Shaowei and Tao, Chaofan and Yang, Sidi and Jiang, Yuxin and Yap, Kim-Hui and Shang, Lifeng and Li, Xiaohui and Bai, Haoli},
  journal={arXiv preprint},
  year={2026}
}