Qwen3.5-9B General-Agent SFT LoRA Adapter

v0.0.1 is the completed LoRA SFT release for the Prime general-agent reproduction using the full, non-prequantized Qwen/Qwen3.5-9B base checkpoint. It does not include the base weights; load this adapter on top of Qwen/Qwen3.5-9B.

This release intentionally excludes the next QLoRA run that is currently in progress. That run will be documented and tagged separately after its training and eval finish.

Release Summary

Field	Value
Release tag	`v0.0.1`
Adapter repo	`benchflow/benchflow-qwen35-9b`
Base checkpoint	`Qwen/Qwen3.5-9B`
Base checkpoint form	Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type	LoRA / PEFT
Source completed run	`general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
W&B project	`general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
HF training artifacts	`benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
Published at	`2026-06-24 22:27:07 UTC`

Research Reproduction Scope

The goal of this adapter is to reproduce the SFT-stage lift from Prime Intellect's general-agent work as closely as possible while using a smaller student model that can train on one H100. The stack keeps the Prime-style task and verifier path:

Source tasks: open-source PrimeIntellect-ai/research-environments/environments/general_agent task corpus.
Teacher trace generation: general-agent-solver-rlm + Azure GPT-5.4-mini through native Verifiers / vf-eval --save-results artifacts.
SFT trainer: Prime-RL SFT.
Student: full, non-quantized Qwen/Qwen3.5-9B loaded in BF16 with LoRA adapters.
Eval: general-agent-solver-local through native vf-eval --save-results on the same held-in task sets before and after SFT.

Data Recipe

Field	Value
Dataset	`benchflow/general-agent-qwen35-9b-azure-gpt54mini-sft`
Dataset rows	`4414`
Original source task count	`4417`
Teacher model	Azure GPT-5.4-mini
Teacher harness	Prime/Verifiers `general-agent-solver-rlm`
Artifact format	Native `vf-eval --save-results` trajectories converted to Prime-RL `messages` + `tool_defs` SFT rows
Excluded source tasks	`dog_breeding_t1`, `skydiving_center_t1`, `skydiving_center_t2`
Exclusion reason	Stable Azure content-filter blocks during teacher trace generation
Full teacher sweep artifact	`benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-daytona-teacher-full4417-tunnel8-20260624T015706Z`
Data validation	Prime SFT JSONL validator rejected non-leading system messages and leakage fields before training

Training Parameters

Field	Value
Trainer	Prime-RL SFT
Model loaded for SFT	`Qwen/Qwen3.5-9B` full BF16 base weights
Quantization	None for the completed `v0.0.1` LoRA run
Adapter	LoRA
LoRA rank	`16`
LoRA alpha	`32`
LoRA dropout	`0.0`
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable params	about `29.1M`
Adapted base params	about `5.30B`
Total base params loaded	about `9.44B`
Sequence length	`2048`
Global batch size	`8`
Micro batch size	`1`
Pack function	`cat`
Shuffle	`true`
Seed	`0`
Optimizer	`AdamW`
Learning rate	`5e-5`
Weight decay	`0.01`
Betas	`0.9`, `0.999`
Grad norm clip	`1.0`
Scheduler	Linear
Warmup steps	`20`
Decay steps	`180`
Minimum LR	`0.0`
Max steps	`200`
Checkpoint interval	`20`
Keep last	`3`
Keep interval	`100`
Save format	`safetensors`
Loss mask	Assistant messages only; system, user, and tool messages are context-only

Training Result

Metric	Value
Completed step	`200`
Final loss	`0.11897`
`loss/nan_count`	`0`
Peak GPU memory	about `40.8 GiB`
Final adapter	`adapter_model.safetensors` in this repo

The initial data.seq_len=8192 Prime-RL BF16 LoRA attempt OOMed on one H100. The completed v0.0.1 run used data.seq_len=2048, system CUDA 12.8 nvcc/ptxas, and g++-12 for the required FLA/TileLang kernels.

Evaluation Results

All evaluations below use native Verifiers vf-eval --save-results, general-agent-solver-local, serving context length 4096, --enable-auto-tool-choice, and --tool-call-parser qwen3_xml. Dynamic vLLM LoRA loading was not reliable for this stack, so eval served a merged local checkpoint built from this adapter plus Qwen/Qwen3.5-9B.

Task set	Base pass rate	LoRA SFT pass rate	Delta	Notes
Held-in 5 smoke	`1/5 = 20.00%`	`2/5 = 40.00%`	`+20.00 pp`	First serving/eval smoke
Held-in 20	`11/20 = 55.00%`	`13/20 = 65.00%`	`+10.00 pp`	Recovered `3d_print_shop_t1`, `accounting_firm_t1`
Held-in 36	`20/36 = 55.56%`	`23/36 = 63.89%`	`+8.33%`	No regressions; recovered `3d_print_shop_t1`, `accounting_firm_t1`, `allergy_clinic_t0`
Held-in 50 assembled	`27/50 = 54.00%`	`30/50 = 60.00%`	`+6.00%`	Latest wider held-in result; final 14-task slice had no net delta
Held-in 50 final 14-task slice	`7/14 = 50.00%`	`7/14 = 50.00%`	`+0.00%`	Recovered `animation_studio_t0`; regressed `antiquarian_bookshop_t0`

Evaluation artifact prefixes:

Held-in 5 smoke: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-smoke4096-20260624T152150Z
Held-in 20 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin20-compare-20260624
Held-in 36 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin36-compare-20260624
Held-in 50 final 14-task run: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin50-gap-20260624T190517Z

Loading

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

This is an SFT-stage reproduction artifact, not the full Prime paper recipe with the original teacher and student model stack.
The trainable dataset has 4414 rows rather than 4417 because three Azure teacher prompts were blocked by content filtering.
The latest held-in50 assembled lift is positive but modest at +6.00 pp; gains are concentrated in a small number of tasks rather than broad across-the-board recovery.
The next QLoRA seq8192 experiment is excluded from v0.0.1 and should receive its own update/tag only after it completes.

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for benchflow/benchflow-qwen35-9b

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(378)

this model