danielcherubini's picture
Upload README.md with huggingface_hub
09ef04a verified
metadata
license: apache-2.0
base_model: Qwen/Qwen3.5-9B
tags:
  - qwen3.5
  - code
  - tool-calling
  - lora
  - sft
  - dpo
  - unsloth
  - reasoning
  - chain-of-thought
datasets:
  - nohurry/Opus-4.6-Reasoning-3000x-filtered
  - Roman1111111/claude-opus-4.6-10000x
  - TeichAI/claude-4.5-opus-high-reasoning-250x
  - Jackrong/Qwen3.5-reasoning-700x
  - togethercomputer/CoderForge-Preview
  - TIGER-Lab/AceCode-V2-122K
language:
  - en
pipeline_tag: text-generation

Qwen3.5-DeltaCoder-9B

Reliable tool-calling for agentic coding β€” LoRA fine-tune of Qwen3.5-9B v1.1-DPO released β€” DPO alignment improves code correctness and self-verification. If you downloaded before March 28, 2026, please re-pull to get v1.1-DPO.

License: Apache 2.0 Base Model HuggingFace LoRA

Small language models can reason about code, but they struggle to call tools reliably. DeltaCoder takes a strong reasoning base and teaches it to produce correctly-formatted JSON tool calls β€” the kind that coding agents like OpenCode, Pi, and Cline depend on.

v1.1-DPO adds Direct Preference Optimization to further improve code correctness β€” the model now self-corrects its own bugs rather than submitting wrong answers.

Downloads

Format Link Size
GGUF Q4_K_M (recommended) HuggingFace ~5.5 GB
GGUF Q5_K_M HuggingFace ~6.5 GB
GGUF BF16 HuggingFace ~17.9 GB
DPO LoRA adapter HuggingFace ~700 MB

The Problem

Jackrong's Qwen3.5-9B reasoning distill scores 53.7% on HumanEval β€” best-in-class at 9B. But when used as a coding agent, it frequently produces malformed JSON tool calls:

tool=edit, error=JSON Parse error: Property name must be a string literal
tool=bash, error=JSON Parse error: Expected '}'

DeltaCoder fixes this, and v1.1-DPO further improves code correctness through preference learning.

What's New in v1.1-DPO

  • Self-correcting behavior β€” detects and fixes its own bugs during agentic tasks
  • Improved code correctness β€” trained on 4,519 preference pairs from AceCode-V2-122K
  • Two-stage merge β€” v1 SFT tool-calling improvements + DPO code quality improvements combined
  • 13 GGUF quants β€” from Q2_K to BF16, covering all VRAM configurations

Training Details

v1 β€” SFT (Tool-Call Reliability)

Parameter Value
Base model Qwen3.5-9B (hybrid GDN architecture)
Method LoRA (r=64, alpha=32)
Dataset CoderForge-Preview filtered_reward1 (50K subset)
Sequence length 4096
Effective batch size 16
Learning rate 1e-4 (cosine)
Epochs 1
Hardware NVIDIA H200 140GB (Vast.ai)
Training time ~10 hours
Final loss ~0.94

v1.1 β€” DPO (Code Correctness)

Parameter Value
Method DPO (Direct Preference Optimization)
Dataset AceCode-V2-122K β€” 4,519 preference pairs
Pair generation 10K problems Γ— 8 samples, keep if β‰₯1 pass AND β‰₯1 fail (45% keep rate)
Beta 0.1
Loss type sigmoid
Learning rate 5e-6 (cosine)
Effective batch size 16
Hardware NVIDIA H100 80GB (Vast.ai)
Training time ~3.7 hours
Final loss 0.538
Rewards/margins (final) ~1.0
Rewards/accuracies (final) ~80%

LoRA Target Modules

All major weight matrices adapted across the hybrid architecture:

  • Full Attention (8/32 layers): q_proj, k_proj, v_proj, o_proj
  • Gated Delta Net (24/32 layers): in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj
  • MLP (all 32 layers): gate_proj, up_proj, down_proj

Usage

Ollama

ollama create deltacoder -f Modelfile

llama.cpp / ik_llama.cpp

./llama-server -m DeltaCoder-9B-v1.1-DPO-Q5_K_M.gguf -ngl 999 -c 131072 -ctk f16 -ctv q4_0 -fa 1 --jinja

With PEFT (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "danielcherubini/Qwen3.5-DeltaCoder-9B")
tokenizer = AutoTokenizer.from_pretrained("danielcherubini/Qwen3.5-DeltaCoder-9B")

Benchmarks

Model HumanEval HumanEval+ Terminal-Bench Easy
Jackrong Qwen3.5-9B-v2 (base) 53.7% β€” β€”
DeltaCoder-9B v1 (temp=0.6) 50.6% 49.4% 2/4 (50%)
DeltaCoder-9B v1.1-DPO (temp=0.6) TBD TBD 2/4 (50%)*

*v1.1-DPO timed out on 2 tasks that v1 answered incorrectly β€” behavioral improvement confirmed, re-evaluating with extended timeout.

Recommended Sampling Settings

Parameter Value
temperature 0.6
top_k 20
top_p 0.95
min_p 0.0
presence_penalty 0.0
repeat_penalty 1.0

Do not use temperature below 0.5 β€” low temperatures cause deterministic looping in multi-turn agentic use.

KV Cache Quantization

Context Length KV Cache VRAM (Q4_K_M) Generation Speed
102,400 f16/q4_0 ~8.5 GB ~111 tok/s
131,072 f16/q4_0 ~9.1 GB ~110 tok/s

Key Findings

Qwen3.5 is a VLM β€” Unsloth treats it as a vision model. For text-only DPO training, use standard HuggingFace + PEFT + TRL directly (no Unsloth DPOTrainer).

Do not use flash_attention_2 with sample packing on Qwen3.5 β€” training loss goes to 0. Use attn_implementation="eager" instead.

  • Qwen3.5 uses Gated Delta Networks β€” include in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj in LoRA target modules or 75% of attention layers are untrained
  • DPO pairs generated on-policy using Qwen/Qwen3.5-9B base with vLLM async inference (32 concurrent requests)
  • Keep rate of 45.2% from 10K AceCode problems (4,519 pairs used for training)

Project Structure

scripts/
  train_unsloth.py          # v1 SFT training
  train_dpo.py              # v1.1 DPO training (HF + PEFT + TRL)
  generate_dpo_pairs.py     # Async on-policy pair generation
  merge_and_export_dpo.py   # Two-stage merge + GGUF export

Status

  • v1 SFT fine-tune (CoderForge, H200, ~10hrs)
  • GGUF export (all quants Q2_K β†’ BF16)
  • HumanEval benchmarking (50.6% / 49.4%)
  • Terminal-Bench evaluation (2/4 easy tasks)
  • DPO pair generation (4,519 pairs from AceCode-V2-122K)
  • v1.1-DPO training (H100, ~3.7hrs)
  • v1.1-DPO GGUF export + HuggingFace release
  • v1.1-DPO HumanEval benchmarking
  • v1.1-DPO Terminal-Bench extended timeout evaluation

Acknowledgements