PEFT
Safetensors
lora
sft
general-agent
qwen

Qwen3.5-9B General-Agent SFT LoRA Adapter

v0.0.1 is the completed LoRA SFT release for the Prime general-agent reproduction using the full, non-prequantized Qwen/Qwen3.5-9B base checkpoint. It does not include the base weights; load this adapter on top of Qwen/Qwen3.5-9B.

This release intentionally excludes the next QLoRA run that is currently in progress. That run will be documented and tagged separately after its training and eval finish.

Release Summary

Field Value
Release tag v0.0.1
Adapter repo benchflow/benchflow-qwen35-9b
Base checkpoint Qwen/Qwen3.5-9B
Base checkpoint form Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type LoRA / PEFT
Source completed run general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z
W&B project general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z
HF training artifacts benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z
Published at 2026-06-24 22:27:07 UTC

Research Reproduction Scope

The goal of this adapter is to reproduce the SFT-stage lift from Prime Intellect's general-agent work as closely as possible while using a smaller student model that can train on one H100. The stack keeps the Prime-style task and verifier path:

  • Source tasks: open-source PrimeIntellect-ai/research-environments/environments/general_agent task corpus.
  • Teacher trace generation: general-agent-solver-rlm + Azure GPT-5.4-mini through native Verifiers / vf-eval --save-results artifacts.
  • SFT trainer: Prime-RL SFT.
  • Student: full, non-quantized Qwen/Qwen3.5-9B loaded in BF16 with LoRA adapters.
  • Eval: general-agent-solver-local through native vf-eval --save-results on the same held-in task sets before and after SFT.

Data Recipe

Field Value
Dataset benchflow/general-agent-qwen35-9b-azure-gpt54mini-sft
Dataset rows 4414
Original source task count 4417
Teacher model Azure GPT-5.4-mini
Teacher harness Prime/Verifiers general-agent-solver-rlm
Artifact format Native vf-eval --save-results trajectories converted to Prime-RL messages + tool_defs SFT rows
Excluded source tasks dog_breeding_t1, skydiving_center_t1, skydiving_center_t2
Exclusion reason Stable Azure content-filter blocks during teacher trace generation
Full teacher sweep artifact benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-daytona-teacher-full4417-tunnel8-20260624T015706Z
Data validation Prime SFT JSONL validator rejected non-leading system messages and leakage fields before training

Training Parameters

Field Value
Trainer Prime-RL SFT
Model loaded for SFT Qwen/Qwen3.5-9B full BF16 base weights
Quantization None for the completed v0.0.1 LoRA run
Adapter LoRA
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params about 29.1M
Adapted base params about 5.30B
Total base params loaded about 9.44B
Sequence length 2048
Global batch size 8
Micro batch size 1
Pack function cat
Shuffle true
Seed 0
Optimizer AdamW
Learning rate 5e-5
Weight decay 0.01
Betas 0.9, 0.999
Grad norm clip 1.0
Scheduler Linear
Warmup steps 20
Decay steps 180
Minimum LR 0.0
Max steps 200
Checkpoint interval 20
Keep last 3
Keep interval 100
Save format safetensors
Loss mask Assistant messages only; system, user, and tool messages are context-only

Training Result

Metric Value
Completed step 200
Final loss 0.11897
loss/nan_count 0
Peak GPU memory about 40.8 GiB
Final adapter adapter_model.safetensors in this repo

The initial data.seq_len=8192 Prime-RL BF16 LoRA attempt OOMed on one H100. The completed v0.0.1 run used data.seq_len=2048, system CUDA 12.8 nvcc/ptxas, and g++-12 for the required FLA/TileLang kernels.

Evaluation Results

All evaluations below use native Verifiers vf-eval --save-results, general-agent-solver-local, serving context length 4096, --enable-auto-tool-choice, and --tool-call-parser qwen3_xml. Dynamic vLLM LoRA loading was not reliable for this stack, so eval served a merged local checkpoint built from this adapter plus Qwen/Qwen3.5-9B.

Task set Base pass rate LoRA SFT pass rate Delta Notes
Held-in 5 smoke 1/5 = 20.00% 2/5 = 40.00% +20.00 pp First serving/eval smoke
Held-in 20 11/20 = 55.00% 13/20 = 65.00% +10.00 pp Recovered 3d_print_shop_t1, accounting_firm_t1
Held-in 36 20/36 = 55.56% 23/36 = 63.89% +8.33% No regressions; recovered 3d_print_shop_t1, accounting_firm_t1, allergy_clinic_t0
Held-in 50 assembled 27/50 = 54.00% 30/50 = 60.00% +6.00% Latest wider held-in result; final 14-task slice had no net delta
Held-in 50 final 14-task slice 7/14 = 50.00% 7/14 = 50.00% +0.00% Recovered animation_studio_t0; regressed antiquarian_bookshop_t0

Evaluation artifact prefixes:

  • Held-in 5 smoke: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-smoke4096-20260624T152150Z
  • Held-in 20 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin20-compare-20260624
  • Held-in 36 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin36-compare-20260624
  • Held-in 50 final 14-task run: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin50-gap-20260624T190517Z

Loading

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

  • This is an SFT-stage reproduction artifact, not the full Prime paper recipe with the original teacher and student model stack.
  • The trainable dataset has 4414 rows rather than 4417 because three Azure teacher prompts were blocked by content filtering.
  • The latest held-in50 assembled lift is positive but modest at +6.00 pp; gains are concentrated in a small number of tasks rather than broad across-the-board recovery.
  • The next QLoRA seq8192 experiment is excluded from v0.0.1 and should receive its own update/tag only after it completes.
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for benchflow/benchflow-qwen35-9b

Finetuned
Qwen/Qwen3.5-9B
Adapter
(378)
this model