Instructions to use armand0e/Qwen3.5-9B-Fable-5-SDFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use armand0e/Qwen3.5-9B-Fable-5-SDFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="armand0e/Qwen3.5-9B-Fable-5-SDFT")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("armand0e/Qwen3.5-9B-Fable-5-SDFT")
model = AutoModelForMultimodalLM.from_pretrained("armand0e/Qwen3.5-9B-Fable-5-SDFT")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use armand0e/Qwen3.5-9B-Fable-5-SDFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "armand0e/Qwen3.5-9B-Fable-5-SDFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armand0e/Qwen3.5-9B-Fable-5-SDFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/armand0e/Qwen3.5-9B-Fable-5-SDFT

SGLang

How to use armand0e/Qwen3.5-9B-Fable-5-SDFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "armand0e/Qwen3.5-9B-Fable-5-SDFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armand0e/Qwen3.5-9B-Fable-5-SDFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "armand0e/Qwen3.5-9B-Fable-5-SDFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armand0e/Qwen3.5-9B-Fable-5-SDFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use armand0e/Qwen3.5-9B-Fable-5-SDFT with Docker Model Runner:
```
docker model run hf.co/armand0e/Qwen3.5-9B-Fable-5-SDFT
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3.5-9B SDFT

This is a merged Qwen3.5-9B model fine-tuned with Self-Distillation Fine-Tuning (SDFT) on agentic coding and tool-use traces from Claude Fable 5.

The training method follows the paper Self-Distillation Enables Continual Learning by Idan Shenfeld, Mehul Damani, Jonas Hubotter, and Pulkit Agrawal.

What SDFT Does

SDFT uses one model in two roles:

Student: the trainable model, prompted only with the conversation so far.
Teacher: the same base model with training adapters disabled, prompted with the conversation plus an in-context expert reference response.

The student samples its own response first. The teacher then scores that same sampled response token by token, but from the stronger prompt that includes the expert demonstration. Training minimizes divergence between the student distribution and the demonstration-conditioned teacher distribution.

                    expert response c
                            |
                            v
conversation x ----> teacher prompt: x + c ----> frozen base model
       |                                             |
       |                                             v
       +---------> student prompt: x ----------> teacher logits over y
                         |
                         v
                 trainable student
                         |
                         v
                sampled response y
                         |
                         v
        reverse KL(student logits || teacher logits)

In one update:

1. Sample y from the current student:
      y ~ pi_theta(. | conversation)

2. Score each sampled token with two distributions:
      student: pi_theta(. | conversation, y_<t)
      teacher: pi_0(. | conversation, expert_reference, y_<t)

3. Train the student toward the teacher on the sampled trajectory:
      loss = KL(pi_theta || pi_0) over the rollout tokens

SDFT vs. SFT

Supervised fine-tuning (SFT) trains on fixed expert-written tokens. That is off-policy: the gradient is computed on a sequence the current model may not have produced itself.

SFT:
  conversation x + expert tokens y*
          |
          v
  cross entropy: -log pi_theta(y* | x)
          |
          v
  off-policy learning on fixed demonstrations

SDFT trains on the model's own sampled tokens. That is on-policy: the update is attached to the current model's actual trajectory, while the teacher prompt uses the expert demonstration to shape the target distribution.

SDFT:
  conversation x ---> current model samples y
          |                    |
          |                    v
          +---- expert c ---> teacher scores y
                               |
                               v
          on-policy distillation on the student's own rollout

This run uses lambda_on_policy = 1.0, so all training examples are on-policy. There is no plain next-token cross-entropy SFT objective in this run.

Model Details

Base model: unsloth/Qwen3.5-9B
Final artifact: merged bf16 model, not a standalone PEFT adapter
Task shape: long-context assistant responses for coding-agent and tool-use traces
Training method: Self-Distillation Fine-Tuning with reverse KL
Context target: 65,536 tokens
Prompt cap: 57,344 tokens
Rollout cap: 8,192 new tokens
Training data: 2,693 filtered SDFT examples derived from armand0e/claude-fable-5-claude-code
Reasoning traces: private/internal reasoning fields are not included in the teacher reference

Training Data

The examples are per-assistant-turn records from agentic coding traces. Each record contains:

the conversation context before an assistant turn
the matching expert assistant turn
optional tool schemas used to render tool calls through the chat template

During SDFT, the expert turn is injected into the teacher prompt inside an <expert_reference> block. The student does not see that block when it samples its response.

Training Procedure

The Colab training profile used:

Setting	Value
Base checkpoint	`unsloth/Qwen3.5-9B`
Max sequence length	`65536`
Max teacher prompt tokens	`57344`
Max rollout tokens	`8192`
Optimizer steps	`600`
Batch size	`1`
Learning rate	`1.0e-5`
Warmup steps	`20`
Weight decay	`0.0`
LoRA rank	`64`
LoRA alpha	`128`
LoRA dropout	`0.0`
Distillation loss	reverse KL
KL temperature	`1.0`
Rollout temperature	`0.8`
Rollout top-p	`0.95`

LoRA targets only language-trunk modules:

q_proj, k_proj, v_proj, o_proj,
gate_proj, up_proj, down_proj,
in_proj_qkv, in_proj_z, out_proj

Vision modules are not LoRA targets in the training script, so the visual tower is not adapted by this text-only run.

How to Use

import torch
from transformers import AutoTokenizer

try:
    from transformers import AutoModelForMultimodalLM as AutoModel
except ImportError:
    from transformers import AutoModelForCausalLM as AutoModel

model_id = "your-name/qwen35-9b-64k-sdft"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Write a small Python function that validates an email address."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Limitations

The model is trained from recorded traces, so it can inherit errors, assumptions, and style from those traces.
SDFT is on-policy per assistant turn, but the surrounding environment feedback is still the recorded expert trajectory. It does not replay tools or sandboxes during training.
Tool calls are generated as model outputs. Downstream systems should validate tool names, arguments, permissions, and side effects before execution.
The run is text/tool-call focused. Multimodal behavior should be validated separately before relying on it.
This is not a safety-tuned or policy-aligned model. Do not use it for high-stakes decisions without additional evaluation and safeguards.

Citation

If you use or discuss the training method, cite the SDFT paper:

@misc{shenfeld2026selfdistillationenablescontinuallearning,
  title = {Self-Distillation Enables Continual Learning},
  author = {Shenfeld, Idan and Damani, Mehul and Hubotter, Jonas and Agrawal, Pulkit},
  year = {2026},
  eprint = {2601.19897},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.19897}
}