Qwable-9B-Claude-Fable-5

Qwable-9B-Claude-Fable-5

Developed by Empero

Qwable-9B-Claude-Fable-5 is a full-parameter supervised fine-tune of Qwen/Qwen3.5-9B on a curated mix of agentic coding and reasoning traces. It is a distillation-style fine-tune: the training targets are outputs from other assistants (Claude Fable 5 and a GPT-5.5 terminal agent), teaching the model to imitate their reasoning and tool-use style on long, multi-turn coding and agent tasks.

Early release. Qwable-9B-Claude-Fable-5 brings strong coding and agentic behavior out of the box. A full suite of quantitative benchmarks (coding, agentic, and safety) is underway and will be added to this card; training quality is already backed by held-out validation results (see Evaluation). See Provenance & licensing for licensing notes.

Model details

  • Developed by: Empero
  • Base model: Qwen3.5-9B — a dense, natively multimodal model with a hybrid attention stack (3:1 Gated DeltaNet linear-attention to Gated full-attention), ~152k vocabulary, long native context.
  • Fine-tune type: full parameter (all text-backbone weights trained). The vision tower was frozen — training was text-only, so vision behavior is inherited from the base and was not tuned or tested.
  • Objective: supervised fine-tuning, assistant-only loss (the model is scored only on the assistant/completion tokens; prompts are masked out).
  • Languages: primarily English.
  • License: apache-2.0, inherited from the base weights — but see the data-provenance caveat below.

Training data

Source Role Approx. examples (after holdout)
Glint-Research/Fable-5-traces Claude Fable 5 reasoning + coding traces (contextcompletion) ~4,585
Roman1111111/gpt5.5-terminal GPT-5.5 terminal/agent task solutions (system + promptsolution) ~111

Both sources were normalized to a single chat format (user/assistant, with an optional system turn for the terminal tasks) and concatenated. The natural mix is heavily skewed toward Fable traces (~97%); no re-weighting was applied to the training set.

Held-out eval split: 100 examples were withheld from training — deliberately composed 80% Fable / 20% terminal so the held-out loss carries signal on both task types rather than being dominated by Fable.

Training procedure

Full-parameter supervised fine-tuning with TRL, using:

  • Full-length traces, zero truncation (max_length = 76,800) — even the longest multi-turn traces (~74k tokens) are trained in full.
  • Assistant-only loss — the model is scored only on assistant/completion tokens; prompt tokens are masked.
  • Chunked cross-entropy for memory-efficient long-context training.
Hyperparameter Value
Epochs 2
Effective batch size 16
Max sequence length 76,800 (no truncation)
Learning rate 1e-5 (cosine, 3% warmup)
Optimizer AdamW (8-bit)
Precision bf16
Loss chunked NLL, assistant-only

Evaluation

Training quality was tracked via held-out validation loss and token-accuracy on a 100-example split and supplemented with a qualitative generation review (below). A full suite of coding, agentic, and safety benchmarks is in progress and will be published here. Validation was run periodically during training:

Step eval loss eval token-acc
100 0.743 0.784
200 0.722 0.789
300 (≈ epoch 1) 0.714 0.791
400 0.7135 0.791
500 0.713 0.791

No overfitting observed. Held-out loss decreased monotonically and then plateaued (~0.71) through the second epoch — it never rose, even as train loss fell to ~0.64. Epoch-1 and final (epoch-2) checkpoints generalize equivalently on held-out data.

Note: token-accuracy is teacher-forced, per-token next-token accuracy over completion tokens only. It is not end-to-end correctness and tends to read high on consistent-style distillation data.

Qualitative generation review

34 prompts spanning coding, terminal/agentic tasks, reasoning, explanation, instruction-following, and honesty/calibration probes were run against the final checkpoint using Qwen3.5's recommended sampling settings. Full unedited transcripts are in sample_generations.md.

Strengths. Coding and terminal/agentic prompts were the strongest — correct, idiomatic solutions using current tooling (e.g. ss over netstat, git-filter-repo, Argon2id) with security-aware judgment (rotating a leaked key first, constant-time comparison, generic auth errors). Reasoning, instruction/format following, and calibration probes were handled well. Roughly 27 of 34 responses were clean and correct.

The model is a reasoning model: every answer begins with a <think> block followed by the final response — downstream consumers should parse out and strip the <think>...</think> span. See Limitations for usage tips.

How to use

The base is a multimodal (image-text-to-text) architecture; for text-only use load it with AutoModelForImageTextToText. Build the prompt with tokenize=False and then tokenize the string (the recommended path for this tokenizer):

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

model_id = "empero-ai/Qwable-9B-Claude-Fable-5"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto"
)

messages = [{"role": "user", "content": "Write a Python function that merges two sorted lists."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.05,
)
# Output begins with a <think>...</think> reasoning block, then the final answer.
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

repetition_penalty=1.05 is a small deviation from Qwen's default (1.0) that prevents rare non-terminating reasoning loops; allow generous max_new_tokens since the model reasons before answering.

Requirements: a recent transformers (Qwen3.5 support) plus the Gated DeltaNet kernels (flash-linear-attention and a CUDA-matched causal_conv1d build) — without them the linear-attention layers fall back to slow, memory-hungry PyTorch ops.

Limitations

Qwable-9B-Claude-Fable-5 is a focused 9B model that shines on the coding, agentic, and reasoning tasks it was trained for. A few characteristics are worth knowing to get the best out of it:

  • It's a reasoning model. Each response opens with a <think> block before the final answer, so parse and strip the <think>...</think> span for end users. On open-ended or creative prompts it may reason at length — allow generous max_new_tokens and use repetition_penalty≈1.05 (as in the snippet above) for consistently crisp completions.
  • Strongest within its domain. Capability is concentrated in coding and agentic/tool-use tasks. For general-knowledge or long-form factual questions, treat specifics as you would any 9B model's — verify before relying on them, and don't expect knowledge of events outside the base model's training.
  • Reflects its base and teachers. As a distillation fine-tune of Qwen3.5-9B on Claude Fable 5 and GPT-5.5 traces, it carries the style and limits of those sources and received no extra safety tuning beyond the base model's. Add your own review/safety layer for production use.
  • Text-only fine-tune. The base is multimodal, but only the text path was trained (vision left untouched and not evaluated here).

These are normal considerations for a compact, domain-focused model rather than blockers — used within its wheelhouse with the sampling settings above, it's a capable and dependable coding/agentic assistant.

Provenance & licensing

The model weights are released under Apache-2.0, inherited from the Qwen3.5-9B base. The fine-tuning data comes from generated traces of Claude Fable 5 and GPT-5.5 (via the linked public datasets). Because those traces originate from third-party assistants, the providers' terms may apply to downstream training and distillation — so if you plan to build on this model commercially, it's worth confirming your use aligns with those terms. Shared with the community for research and experimentation, as-is.

Acknowledgements

Downloads last month
52
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for empero-ai/Qwable-9B-Claude-Fable-5

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(382)
this model
Finetunes
1 model
Quantizations
10 models

Datasets used to train empero-ai/Qwable-9B-Claude-Fable-5

Collection including empero-ai/Qwable-9B-Claude-Fable-5