Instructions to use empero-ai/Qwable-9B-Claude-Fable-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use empero-ai/Qwable-9B-Claude-Fable-5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="empero-ai/Qwable-9B-Claude-Fable-5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("empero-ai/Qwable-9B-Claude-Fable-5")
model = AutoModelForMultimodalLM.from_pretrained("empero-ai/Qwable-9B-Claude-Fable-5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use empero-ai/Qwable-9B-Claude-Fable-5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "empero-ai/Qwable-9B-Claude-Fable-5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "empero-ai/Qwable-9B-Claude-Fable-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/empero-ai/Qwable-9B-Claude-Fable-5

SGLang

How to use empero-ai/Qwable-9B-Claude-Fable-5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "empero-ai/Qwable-9B-Claude-Fable-5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "empero-ai/Qwable-9B-Claude-Fable-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "empero-ai/Qwable-9B-Claude-Fable-5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "empero-ai/Qwable-9B-Claude-Fable-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use empero-ai/Qwable-9B-Claude-Fable-5 with Docker Model Runner:
```
docker model run hf.co/empero-ai/Qwable-9B-Claude-Fable-5
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwable-9B-Claude-Fable-5

Developed by Empero

Qwable-9B-Claude-Fable-5 is a full-parameter supervised fine-tune of Qwen/Qwen3.5-9B on a curated mix of agentic coding and reasoning traces. It is a distillation-style fine-tune: the training targets are outputs from other assistants (Claude Fable 5 and a GPT-5.5 terminal agent), teaching the model to imitate their reasoning and tool-use style on long, multi-turn coding and agent tasks.

Early release. Qwable-9B-Claude-Fable-5 brings strong coding and agentic behavior out of the box. A full suite of quantitative benchmarks (coding, agentic, and safety) is underway and will be added to this card; training quality is already backed by held-out validation results (see Evaluation). See Provenance & licensing for licensing notes.

Model details

Developed by: Empero
Base model: Qwen3.5-9B — a dense, natively multimodal model with a hybrid attention stack (3:1 Gated DeltaNet linear-attention to Gated full-attention), ~152k vocabulary, long native context.
Fine-tune type: full parameter (all text-backbone weights trained). The vision tower was frozen — training was text-only, so vision behavior is inherited from the base and was not tuned or tested.
Objective: supervised fine-tuning, assistant-only loss (the model is scored only on the assistant/completion tokens; prompts are masked out).
Languages: primarily English.
License: apache-2.0, inherited from the base weights — but see the data-provenance caveat below.

Training data

Source	Role	Approx. examples (after holdout)
`Glint-Research/Fable-5-traces`	Claude Fable 5 reasoning + coding traces (`context` → `completion`)	~4,585
`Roman1111111/gpt5.5-terminal`	GPT-5.5 terminal/agent task solutions (`system` + `prompt` → `solution`)	~111

Both sources were normalized to a single chat format (user/assistant, with an optional system turn for the terminal tasks) and concatenated. The natural mix is heavily skewed toward Fable traces (~97%); no re-weighting was applied to the training set.

Held-out eval split: 100 examples were withheld from training — deliberately composed 80% Fable / 20% terminal so the held-out loss carries signal on both task types rather than being dominated by Fable.

Training procedure

Full-parameter supervised fine-tuning with TRL, using:

Full-length traces, zero truncation (max_length = 76,800) — even the longest multi-turn traces (~74k tokens) are trained in full.
Assistant-only loss — the model is scored only on assistant/completion tokens; prompt tokens are masked.
Chunked cross-entropy for memory-efficient long-context training.

Hyperparameter	Value
Epochs	2
Effective batch size	16
Max sequence length	76,800 (no truncation)
Learning rate	1e-5 (cosine, 3% warmup)
Optimizer	AdamW (8-bit)
Precision	bf16
Loss	chunked NLL, assistant-only

Evaluation

Training quality was tracked via held-out validation loss and token-accuracy on a 100-example split and supplemented with a qualitative generation review (below). A full suite of coding, agentic, and safety benchmarks is in progress and will be published here. Validation was run periodically during training:

Step	eval loss	eval token-acc
100	0.743	0.784
200	0.722	0.789
300 (≈ epoch 1)	0.714	0.791
400	0.7135	0.791
500	0.713	0.791

No overfitting observed. Held-out loss decreased monotonically and then plateaued (~0.71) through the second epoch — it never rose, even as train loss fell to ~0.64. Epoch-1 and final (epoch-2) checkpoints generalize equivalently on held-out data.

Note: token-accuracy is teacher-forced, per-token next-token accuracy over completion tokens only. It is not end-to-end correctness and tends to read high on consistent-style distillation data.

Qualitative generation review

34 prompts spanning coding, terminal/agentic tasks, reasoning, explanation, instruction-following, and honesty/calibration probes were run against the final checkpoint using Qwen3.5's recommended sampling settings. Full unedited transcripts are in sample_generations.md.

Strengths. Coding and terminal/agentic prompts were the strongest — correct, idiomatic solutions using current tooling (e.g. ss over netstat, git-filter-repo, Argon2id) with security-aware judgment (rotating a leaked key first, constant-time comparison, generic auth errors). Reasoning, instruction/format following, and calibration probes were handled well. Roughly 27 of 34 responses were clean and correct.

The model is a reasoning model: every answer begins with a <think> block followed by the final response — downstream consumers should parse out and strip the <think>...</think> span. See Limitations for usage tips.

How to use

The base is a multimodal (image-text-to-text) architecture; for text-only use load it with AutoModelForImageTextToText. Build the prompt with tokenize=False and then tokenize the string (the recommended path for this tokenizer):

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

model_id = "empero-ai/Qwable-9B-Claude-Fable-5"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto"
)

messages = [{"role": "user", "content": "Write a Python function that merges two sorted lists."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.05,
)
# Output begins with a <think>...</think> reasoning block, then the final answer.
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

repetition_penalty=1.05 is a small deviation from Qwen's default (1.0) that prevents rare non-terminating reasoning loops; allow generous max_new_tokens since the model reasons before answering.

Requirements: a recent transformers (Qwen3.5 support) plus the Gated DeltaNet kernels (flash-linear-attention and a CUDA-matched causal_conv1d build) — without them the linear-attention layers fall back to slow, memory-hungry PyTorch ops.

Limitations

Qwable-9B-Claude-Fable-5 is a focused 9B model that shines on the coding, agentic, and reasoning tasks it was trained for. A few characteristics are worth knowing to get the best out of it:

It's a reasoning model. Each response opens with a <think> block before the final answer, so parse and strip the <think>...</think> span for end users. On open-ended or creative prompts it may reason at length — allow generous max_new_tokens and use repetition_penalty≈1.05 (as in the snippet above) for consistently crisp completions.
Strongest within its domain. Capability is concentrated in coding and agentic/tool-use tasks. For general-knowledge or long-form factual questions, treat specifics as you would any 9B model's — verify before relying on them, and don't expect knowledge of events outside the base model's training.
Reflects its base and teachers. As a distillation fine-tune of Qwen3.5-9B on Claude Fable 5 and GPT-5.5 traces, it carries the style and limits of those sources and received no extra safety tuning beyond the base model's. Add your own review/safety layer for production use.
Text-only fine-tune. The base is multimodal, but only the text path was trained (vision left untouched and not evaluated here).

These are normal considerations for a compact, domain-focused model rather than blockers — used within its wheelhouse with the sampling settings above, it's a capable and dependable coding/agentic assistant.

Provenance & licensing

The model weights are released under Apache-2.0, inherited from the Qwen3.5-9B base. The fine-tuning data comes from generated traces of Claude Fable 5 and GPT-5.5 (via the linked public datasets). Because those traces originate from third-party assistants, the providers' terms may apply to downstream training and distillation — so if you plan to build on this model commercially, it's worth confirming your use aligns with those terms. Shared with the community for research and experimentation, as-is.