Instructions to use empero-ai/Qwable-9B-Claude-Fable-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use empero-ai/Qwable-9B-Claude-Fable-5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="empero-ai/Qwable-9B-Claude-Fable-5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("empero-ai/Qwable-9B-Claude-Fable-5") model = AutoModelForMultimodalLM.from_pretrained("empero-ai/Qwable-9B-Claude-Fable-5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use empero-ai/Qwable-9B-Claude-Fable-5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "empero-ai/Qwable-9B-Claude-Fable-5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "empero-ai/Qwable-9B-Claude-Fable-5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/empero-ai/Qwable-9B-Claude-Fable-5
- SGLang
How to use empero-ai/Qwable-9B-Claude-Fable-5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "empero-ai/Qwable-9B-Claude-Fable-5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "empero-ai/Qwable-9B-Claude-Fable-5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "empero-ai/Qwable-9B-Claude-Fable-5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "empero-ai/Qwable-9B-Claude-Fable-5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use empero-ai/Qwable-9B-Claude-Fable-5 with Docker Model Runner:
docker model run hf.co/empero-ai/Qwable-9B-Claude-Fable-5
Qwable-9B-Claude-Fable-5
Developed by Empero
Qwable-9B-Claude-Fable-5 is a full-parameter supervised fine-tune of Qwen/Qwen3.5-9B on a curated mix of agentic coding and reasoning traces. It is a distillation-style fine-tune: the training targets are outputs from other assistants (Claude Fable 5 and a GPT-5.5 terminal agent), teaching the model to imitate their reasoning and tool-use style on long, multi-turn coding and agent tasks.
Early release. Qwable-9B-Claude-Fable-5 brings strong coding and agentic behavior out of the box. A full suite of quantitative benchmarks (coding, agentic, and safety) is underway and will be added to this card; training quality is already backed by held-out validation results (see Evaluation). See Provenance & licensing for licensing notes.
Model details
- Developed by: Empero
- Base model: Qwen3.5-9B — a dense, natively multimodal model with a hybrid attention stack (3:1 Gated DeltaNet linear-attention to Gated full-attention), ~152k vocabulary, long native context.
- Fine-tune type: full parameter (all text-backbone weights trained). The vision tower was frozen — training was text-only, so vision behavior is inherited from the base and was not tuned or tested.
- Objective: supervised fine-tuning, assistant-only loss (the model is scored only on the assistant/completion tokens; prompts are masked out).
- Languages: primarily English.
- License:
apache-2.0, inherited from the base weights — but see the data-provenance caveat below.
Training data
| Source | Role | Approx. examples (after holdout) |
|---|---|---|
Glint-Research/Fable-5-traces |
Claude Fable 5 reasoning + coding traces (context → completion) |
~4,585 |
Roman1111111/gpt5.5-terminal |
GPT-5.5 terminal/agent task solutions (system + prompt → solution) |
~111 |
Both sources were normalized to a single chat format (user/assistant, with an optional system turn for
the terminal tasks) and concatenated. The natural mix is heavily skewed toward Fable traces (~97%); no
re-weighting was applied to the training set.
Held-out eval split: 100 examples were withheld from training — deliberately composed 80% Fable / 20% terminal so the held-out loss carries signal on both task types rather than being dominated by Fable.
Training procedure
Full-parameter supervised fine-tuning with TRL, using:
- Full-length traces, zero truncation (
max_length = 76,800) — even the longest multi-turn traces (~74k tokens) are trained in full. - Assistant-only loss — the model is scored only on assistant/completion tokens; prompt tokens are masked.
- Chunked cross-entropy for memory-efficient long-context training.
| Hyperparameter | Value |
|---|---|
| Epochs | 2 |
| Effective batch size | 16 |
| Max sequence length | 76,800 (no truncation) |
| Learning rate | 1e-5 (cosine, 3% warmup) |
| Optimizer | AdamW (8-bit) |
| Precision | bf16 |
| Loss | chunked NLL, assistant-only |
Evaluation
Training quality was tracked via held-out validation loss and token-accuracy on a 100-example split and supplemented with a qualitative generation review (below). A full suite of coding, agentic, and safety benchmarks is in progress and will be published here. Validation was run periodically during training:
| Step | eval loss | eval token-acc |
|---|---|---|
| 100 | 0.743 | 0.784 |
| 200 | 0.722 | 0.789 |
| 300 (≈ epoch 1) | 0.714 | 0.791 |
| 400 | 0.7135 | 0.791 |
| 500 | 0.713 | 0.791 |
No overfitting observed. Held-out loss decreased monotonically and then plateaued (~0.71) through the second epoch — it never rose, even as train loss fell to ~0.64. Epoch-1 and final (epoch-2) checkpoints generalize equivalently on held-out data.
Note:
token-accuracyis teacher-forced, per-token next-token accuracy over completion tokens only. It is not end-to-end correctness and tends to read high on consistent-style distillation data.
Qualitative generation review
34 prompts spanning coding, terminal/agentic tasks, reasoning, explanation, instruction-following, and
honesty/calibration probes were run against the final checkpoint using Qwen3.5's recommended sampling
settings. Full unedited transcripts are in sample_generations.md.
Strengths. Coding and terminal/agentic prompts were the strongest — correct, idiomatic solutions using
current tooling (e.g. ss over netstat, git-filter-repo, Argon2id) with security-aware judgment
(rotating a leaked key first, constant-time comparison, generic auth errors). Reasoning, instruction/format
following, and calibration probes were handled well. Roughly 27 of 34 responses were clean and correct.
The model is a reasoning model: every answer begins with a <think> block followed by the final
response — downstream consumers should parse out and strip the <think>...</think> span. See
Limitations for usage tips.
How to use
The base is a multimodal (image-text-to-text) architecture; for text-only use load it with
AutoModelForImageTextToText. Build the prompt with tokenize=False and then tokenize the string
(the recommended path for this tokenizer):
import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer
model_id = "empero-ai/Qwable-9B-Claude-Fable-5"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype="bfloat16", device_map="auto"
)
messages = [{"role": "user", "content": "Write a Python function that merges two sorted lists."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(
**inputs, max_new_tokens=2048, do_sample=True,
temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.05,
)
# Output begins with a <think>...</think> reasoning block, then the final answer.
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
repetition_penalty=1.05 is a small deviation from Qwen's default (1.0) that prevents rare
non-terminating reasoning loops; allow generous max_new_tokens since the model reasons before answering.
Requirements: a recent transformers (Qwen3.5 support) plus the Gated DeltaNet kernels
(flash-linear-attention and a CUDA-matched causal_conv1d build) — without them the linear-attention
layers fall back to slow, memory-hungry PyTorch ops.
Limitations
Qwable-9B-Claude-Fable-5 is a focused 9B model that shines on the coding, agentic, and reasoning tasks it was trained for. A few characteristics are worth knowing to get the best out of it:
- It's a reasoning model. Each response opens with a
<think>block before the final answer, so parse and strip the<think>...</think>span for end users. On open-ended or creative prompts it may reason at length — allow generousmax_new_tokensand userepetition_penalty≈1.05(as in the snippet above) for consistently crisp completions. - Strongest within its domain. Capability is concentrated in coding and agentic/tool-use tasks. For general-knowledge or long-form factual questions, treat specifics as you would any 9B model's — verify before relying on them, and don't expect knowledge of events outside the base model's training.
- Reflects its base and teachers. As a distillation fine-tune of Qwen3.5-9B on Claude Fable 5 and GPT-5.5 traces, it carries the style and limits of those sources and received no extra safety tuning beyond the base model's. Add your own review/safety layer for production use.
- Text-only fine-tune. The base is multimodal, but only the text path was trained (vision left untouched and not evaluated here).
These are normal considerations for a compact, domain-focused model rather than blockers — used within its wheelhouse with the sampling settings above, it's a capable and dependable coding/agentic assistant.
Provenance & licensing
The model weights are released under Apache-2.0, inherited from the Qwen3.5-9B base. The fine-tuning data comes from generated traces of Claude Fable 5 and GPT-5.5 (via the linked public datasets). Because those traces originate from third-party assistants, the providers' terms may apply to downstream training and distillation — so if you plan to build on this model commercially, it's worth confirming your use aligns with those terms. Shared with the community for research and experimentation, as-is.
Acknowledgements
- Developed and released by Empero
- Base model: Qwen3.5-9B (Alibaba Qwen team)
- Datasets:
Glint-Research/Fable-5-traces,Roman1111111/gpt5.5-terminal - Training: TRL + Transformers
- Downloads last month
- 52
Model tree for empero-ai/Qwable-9B-Claude-Fable-5
Base model
Qwen/Qwen3.5-9B-Base