Qwen3.5-9B SDFT

This is a merged Qwen3.5-9B model fine-tuned with Self-Distillation Fine-Tuning (SDFT) on agentic coding and tool-use traces from Claude Fable 5.

The training method follows the paper Self-Distillation Enables Continual Learning by Idan Shenfeld, Mehul Damani, Jonas Hubotter, and Pulkit Agrawal.

What SDFT Does

SDFT uses one model in two roles:

  • Student: the trainable model, prompted only with the conversation so far.
  • Teacher: the same base model with training adapters disabled, prompted with the conversation plus an in-context expert reference response.

The student samples its own response first. The teacher then scores that same sampled response token by token, but from the stronger prompt that includes the expert demonstration. Training minimizes divergence between the student distribution and the demonstration-conditioned teacher distribution.

                    expert response c
                            |
                            v
conversation x ----> teacher prompt: x + c ----> frozen base model
       |                                             |
       |                                             v
       +---------> student prompt: x ----------> teacher logits over y
                         |
                         v
                 trainable student
                         |
                         v
                sampled response y
                         |
                         v
        reverse KL(student logits || teacher logits)

In one update:

1. Sample y from the current student:
      y ~ pi_theta(. | conversation)

2. Score each sampled token with two distributions:
      student: pi_theta(. | conversation, y_<t)
      teacher: pi_0(. | conversation, expert_reference, y_<t)

3. Train the student toward the teacher on the sampled trajectory:
      loss = KL(pi_theta || pi_0) over the rollout tokens

SDFT vs. SFT

image

Supervised fine-tuning (SFT) trains on fixed expert-written tokens. That is off-policy: the gradient is computed on a sequence the current model may not have produced itself.

SFT:
  conversation x + expert tokens y*
          |
          v
  cross entropy: -log pi_theta(y* | x)
          |
          v
  off-policy learning on fixed demonstrations

SDFT trains on the model's own sampled tokens. That is on-policy: the update is attached to the current model's actual trajectory, while the teacher prompt uses the expert demonstration to shape the target distribution.

SDFT:
  conversation x ---> current model samples y
          |                    |
          |                    v
          +---- expert c ---> teacher scores y
                               |
                               v
          on-policy distillation on the student's own rollout

This run uses lambda_on_policy = 1.0, so all training examples are on-policy. There is no plain next-token cross-entropy SFT objective in this run.

Model Details

  • Base model: unsloth/Qwen3.5-9B
  • Final artifact: merged bf16 model, not a standalone PEFT adapter
  • Task shape: long-context assistant responses for coding-agent and tool-use traces
  • Training method: Self-Distillation Fine-Tuning with reverse KL
  • Context target: 65,536 tokens
  • Prompt cap: 57,344 tokens
  • Rollout cap: 8,192 new tokens
  • Training data: 2,693 filtered SDFT examples derived from armand0e/claude-fable-5-claude-code
  • Reasoning traces: private/internal reasoning fields are not included in the teacher reference

Training Data

The examples are per-assistant-turn records from agentic coding traces. Each record contains:

  • the conversation context before an assistant turn
  • the matching expert assistant turn
  • optional tool schemas used to render tool calls through the chat template

During SDFT, the expert turn is injected into the teacher prompt inside an <expert_reference> block. The student does not see that block when it samples its response.

Training Procedure

The Colab training profile used:

Setting Value
Base checkpoint unsloth/Qwen3.5-9B
Max sequence length 65536
Max teacher prompt tokens 57344
Max rollout tokens 8192
Optimizer steps 600
Batch size 1
Learning rate 1.0e-5
Warmup steps 20
Weight decay 0.0
LoRA rank 64
LoRA alpha 128
LoRA dropout 0.0
Distillation loss reverse KL
KL temperature 1.0
Rollout temperature 0.8
Rollout top-p 0.95

LoRA targets only language-trunk modules:

q_proj, k_proj, v_proj, o_proj,
gate_proj, up_proj, down_proj,
in_proj_qkv, in_proj_z, out_proj

Vision modules are not LoRA targets in the training script, so the visual tower is not adapted by this text-only run.

How to Use

import torch
from transformers import AutoTokenizer

try:
    from transformers import AutoModelForMultimodalLM as AutoModel
except ImportError:
    from transformers import AutoModelForCausalLM as AutoModel

model_id = "your-name/qwen35-9b-64k-sdft"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Write a small Python function that validates an email address."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Limitations

  • The model is trained from recorded traces, so it can inherit errors, assumptions, and style from those traces.
  • SDFT is on-policy per assistant turn, but the surrounding environment feedback is still the recorded expert trajectory. It does not replay tools or sandboxes during training.
  • Tool calls are generated as model outputs. Downstream systems should validate tool names, arguments, permissions, and side effects before execution.
  • The run is text/tool-call focused. Multimodal behavior should be validated separately before relying on it.
  • This is not a safety-tuned or policy-aligned model. Do not use it for high-stakes decisions without additional evaluation and safeguards.

Citation

If you use or discuss the training method, cite the SDFT paper:

@misc{shenfeld2026selfdistillationenablescontinuallearning,
  title = {Self-Distillation Enables Continual Learning},
  author = {Shenfeld, Idan and Damani, Mehul and Hubotter, Jonas and Agrawal, Pulkit},
  year = {2026},
  eprint = {2601.19897},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.19897}
}
Downloads last month
12
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for armand0e/Qwen3.5-9B-Fable-5-SDFT

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(110)
this model
Quantizations
1 model

Dataset used to train armand0e/Qwen3.5-9B-Fable-5-SDFT

Paper for armand0e/Qwen3.5-9B-Fable-5-SDFT