SWE-Review-8B

An agentic code review model fine-tuned from Qwen3-8B on 8,914 review trajectories (SWE-Review-Traj). The model explores a repository via tool calls, independently traces the root cause of an issue, and produces a structured review decision with diagnostic feedback for revision.

Project Page | Paper | Code | Benchmark | Training Data | Claude Code Plugin

About SWE-Review

SWE-Review is a framework for closing the issue-resolution loop with agentic code review. A reviewer agent independently explores the repository, traces the root cause, and compares its own diagnosis against the submitted PR — turning one-shot patch generation into an iterative generate-review-revise loop that raises resolve rates by up to +29.4 percentage points on SWE-bench Verified.

Quick Start

Serve with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model SWE-Lego/SWE-Review-8B \
    --served-model-name SWE-Review-8B \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 131072 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --chat-template-content-format string \
    --api-key dummy-key

Run a Review

The model is designed to be used as an agentic reviewer with tool-calling (file reading, code search, etc.). See the SWE-Review code repository for the full agentic review pipeline, including Harbor-based orchestration and benchmark evaluation scripts.

For a plug-and-play experience in Claude Code, install the cc-swe-review plugin.

Evaluation Results

Performance on SWE-Review-Bench (1,384 instances across 3 quality tiers):

Split CR (%) DA (%) RRR (%) Δ RR
GLM-5 (high, baseline 72.2%) 84.2 68.7 71.6 -0.6
Coder-30B (medium, baseline 50.9%) 81.4 66.9 52.8 +1.9
Qwen3-30B (low, baseline 27.5%) 71.1 71.6 35.1 +7.6

Improvement over Base Model (Qwen3-8B without SFT)

Split DA (base → SFT) RRR (base → SFT)
GLM-5 (high) 49.0 → 68.7 (+19.7) 72.2 → 71.6 (-0.6)
Coder-30B (medium) 49.1 → 66.9 (+17.8) 50.9 → 52.8 (+1.9)
Qwen3-30B (low) 50.8 → 71.6 (+20.8) 27.5 → 35.1 (+7.6)

SFT dramatically improves DA across all splits. On harder PRs (medium/low tiers), review-guided revision further raises resolve rate.

Test-Time Scaling

When used as the reviewer in iterative review-revision loops, the model enables 3.7× the test-time scaling gain at 6.7× the efficiency of independent resampling, reaching 38.4% resolve rate with only 2.44 samples on average.

Training Details

Property Value
Base Model Qwen/Qwen3-8B
Training Data SWE-Review-Traj (8,914 trajectories)
Framework LLaMA-Factory + DeepSpeed ZeRO-3
Epochs 4
Learning Rate 1e-4 (cosine scheduler)
Max Context Length 131,072 tokens (YaRN rope scaling)
Optimizations Liger kernel + Flash Attention v2

Important Notes

  • Use --tool-call-parser hermes (not qwen3_coder)
  • Set OPENHANDS_LLM_NATIVE_TOOL_CALLING=true when using with OpenHands
  • Do not use --reasoning-parser with this model
  • For Harbor agent execution, use http://172.17.0.1:<port>/v1 (Docker bridge gateway), not localhost

Citation

@article{wang2026swereview,
  title={SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review},
  author={Wang, Ruoyu and Chen, Jierun and Wang, Shaowei and Tao, Chaofan and Yang, Sidi and Jiang, Yuxin and Yap, Kim-Hui and Shang, Lifeng and Li, Xiaohui and Bai, Haoli},
  journal={arXiv preprint},
  year={2026}
}
Downloads last month
26
Safetensors
Model size
308k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SWE-Lego/SWE-Review-8B

Finetuned
Qwen/Qwen3-8B
Finetuned
(1686)
this model
Quantizations
2 models

Collection including SWE-Lego/SWE-Review-8B