Upload README.md with huggingface_hub

09ef04a verified 3 days ago

8.2 kB

license: apache-2.0
base_model: Qwen/Qwen3.5-9B
tags:
  - qwen3.5
  - code
  - tool-calling
  - lora
  - sft
  - dpo
  - unsloth
  - reasoning
  - chain-of-thought
datasets:
  - nohurry/Opus-4.6-Reasoning-3000x-filtered
  - Roman1111111/claude-opus-4.6-10000x
  - TeichAI/claude-4.5-opus-high-reasoning-250x
  - Jackrong/Qwen3.5-reasoning-700x
  - togethercomputer/CoderForge-Preview
  - TIGER-Lab/AceCode-V2-122K
language:
  - en
pipeline_tag: text-generation

Qwen3.5-DeltaCoder-9B

Reliable tool-calling for agentic coding — LoRA fine-tune of Qwen3.5-9B v1.1-DPO released — DPO alignment improves code correctness and self-verification. If you downloaded before March 28, 2026, please re-pull to get v1.1-DPO.

Small language models can reason about code, but they struggle to call tools reliably. DeltaCoder takes a strong reasoning base and teaches it to produce correctly-formatted JSON tool calls — the kind that coding agents like OpenCode, Pi, and Cline depend on.

v1.1-DPO adds Direct Preference Optimization to further improve code correctness — the model now self-corrects its own bugs rather than submitting wrong answers.

Downloads

Format	Link	Size
GGUF Q4_K_M (recommended)	HuggingFace	~5.5 GB
GGUF Q5_K_M	HuggingFace	~6.5 GB
GGUF BF16	HuggingFace	~17.9 GB
DPO LoRA adapter	HuggingFace	~700 MB

The Problem

Jackrong's Qwen3.5-9B reasoning distill scores 53.7% on HumanEval — best-in-class at 9B. But when used as a coding agent, it frequently produces malformed JSON tool calls:

tool=edit, error=JSON Parse error: Property name must be a string literal
tool=bash, error=JSON Parse error: Expected '}'

DeltaCoder fixes this, and v1.1-DPO further improves code correctness through preference learning.

What's New in v1.1-DPO

Self-correcting behavior — detects and fixes its own bugs during agentic tasks
Improved code correctness — trained on 4,519 preference pairs from AceCode-V2-122K
Two-stage merge — v1 SFT tool-calling improvements + DPO code quality improvements combined
13 GGUF quants — from Q2_K to BF16, covering all VRAM configurations

Training Details

v1 — SFT (Tool-Call Reliability)

Parameter	Value
Base model	Qwen3.5-9B (hybrid GDN architecture)
Method	LoRA (r=64, alpha=32)
Dataset	CoderForge-Preview `filtered_reward1` (50K subset)
Sequence length	4096
Effective batch size	16
Learning rate	1e-4 (cosine)
Epochs	1
Hardware	NVIDIA H200 140GB (Vast.ai)
Training time	~10 hours
Final loss	~0.94

v1.1 — DPO (Code Correctness)

Parameter	Value
Method	DPO (Direct Preference Optimization)
Dataset	AceCode-V2-122K — 4,519 preference pairs
Pair generation	10K problems × 8 samples, keep if ≥1 pass AND ≥1 fail (45% keep rate)
Beta	0.1
Loss type	sigmoid
Learning rate	5e-6 (cosine)
Effective batch size	16
Hardware	NVIDIA H100 80GB (Vast.ai)
Training time	~3.7 hours
Final loss	0.538
Rewards/margins (final)	~1.0
Rewards/accuracies (final)	~80%

LoRA Target Modules

All major weight matrices adapted across the hybrid architecture:

Full Attention (8/32 layers): q_proj, k_proj, v_proj, o_proj
Gated Delta Net (24/32 layers): in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj
MLP (all 32 layers): gate_proj, up_proj, down_proj

Usage

Ollama

ollama create deltacoder -f Modelfile

llama.cpp / ik_llama.cpp

./llama-server -m DeltaCoder-9B-v1.1-DPO-Q5_K_M.gguf -ngl 999 -c 131072 -ctk f16 -ctv q4_0 -fa 1 --jinja

With PEFT (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "danielcherubini/Qwen3.5-DeltaCoder-9B")
tokenizer = AutoTokenizer.from_pretrained("danielcherubini/Qwen3.5-DeltaCoder-9B")

Benchmarks

Model	HumanEval	HumanEval+	Terminal-Bench Easy
Jackrong Qwen3.5-9B-v2 (base)	53.7%	—	—
DeltaCoder-9B v1 (temp=0.6)	50.6%	49.4%	2/4 (50%)
DeltaCoder-9B v1.1-DPO (temp=0.6)	TBD	TBD	2/4 (50%)*

*v1.1-DPO timed out on 2 tasks that v1 answered incorrectly — behavioral improvement confirmed, re-evaluating with extended timeout.

Recommended Sampling Settings

Parameter	Value
temperature	0.6
top_k	20
top_p	0.95
min_p	0.0
presence_penalty	0.0
repeat_penalty	1.0

Do not use temperature below 0.5 — low temperatures cause deterministic looping in multi-turn agentic use.

KV Cache Quantization

Context Length	KV Cache	VRAM (Q4_K_M)	Generation Speed
102,400	f16/q4_0	~8.5 GB	~111 tok/s
131,072	f16/q4_0	~9.1 GB	~110 tok/s

Key Findings

Qwen3.5 is a VLM — Unsloth treats it as a vision model. For text-only DPO training, use standard HuggingFace + PEFT + TRL directly (no Unsloth DPOTrainer).

Do not use flash_attention_2 with sample packing on Qwen3.5 — training loss goes to 0. Use attn_implementation="eager" instead.

Qwen3.5 uses Gated Delta Networks — include in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj in LoRA target modules or 75% of attention layers are untrained
DPO pairs generated on-policy using Qwen/Qwen3.5-9B base with vLLM async inference (32 concurrent requests)
Keep rate of 45.2% from 10K AceCode problems (4,519 pairs used for training)

Project Structure

scripts/
  train_unsloth.py          # v1 SFT training
  train_dpo.py              # v1.1 DPO training (HF + PEFT + TRL)
  generate_dpo_pairs.py     # Async on-policy pair generation
  merge_and_export_dpo.py   # Two-stage merge + GGUF export

Status

v1 SFT fine-tune (CoderForge, H200, ~10hrs)
GGUF export (all quants Q2_K → BF16)
HumanEval benchmarking (50.6% / 49.4%)
Terminal-Bench evaluation (2/4 easy tasks)
DPO pair generation (4,519 pairs from AceCode-V2-122K)
v1.1-DPO training (H100, ~3.7hrs)
v1.1-DPO GGUF export + HuggingFace release
v1.1-DPO HumanEval benchmarking
v1.1-DPO Terminal-Bench extended timeout evaluation

Acknowledgements

Unsloth for Qwen3.5 SFT training support
Together AI for the CoderForge dataset
TIGER Lab for AceCode-V2-122K
Jackrong for the reasoning distillation
Qwen for the base model