Instructions to use interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16")
model = AutoModelForMultimodalLM.from_pretrained("interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16

SGLang

How to use interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16 with Docker Model Runner:
```
docker model run hf.co/interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

BF16-Safetensors converted version of yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

We converted the Q8_0 quantized version -which has the highest performance- for easier fine-tuning, LLM surgery and merging!

Original Model Card:

💻 Gemma4-12B-Coder (GGUF) — Composer 2.5 × Fable 5 ✨

🐣 Tiny footprint, big brain — a local coding model for everyone

No matter your GPU. No matter your RAM. If you've got ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding assistant right now. 🚀 This is the v1 / code edition — distilled from real chain-of-thought so it thinks through a problem before writing the solution. 🧠💻 All local, all yours, no API, no cloud.

🎯 What it is

A focused fine-tune of Gemma 4 12B on verifiable Python coding data — every training example's reasoning leads to code that actually passed its tests. The result reasons in the open (edge cases, complexity, approach) and then emits a clean, runnable solution. 💚

📣 Context length fixed: now 256K (was 131K) — thanks, community! 💚

A community member spotted that this model was reporting only a 131K context window. That turned out to be the well-known upstream Gemma 4 metadata bug — Google's initial config.json shipped with max_position_embeddings: 131072 instead of the real 262144 (256K), and that value got baked into a lot of downstream finetunes and quants (including this one) before it was fixed upstream.

The weights were always fine — it was purely a metadata field. All GGUF quants have been re-patched to the full 256K context (gemma4.context_length = 262144). Just re-download if you grabbed an earlier copy. 🙏

📚 Training data (the interesting part 🍳)

This is a distillation of two complementary chain-of-thought sources, both over verifiable Python coding tasks (algorithmic / function-level problems that come with deterministic tests):

🥇 Main set — Composer 2.5 real CoT. Genuine, model-authored reasoning traces. The teacher solved each problem, its code was run against the task's tests, and only the passing solutions were kept. So the reasoning you're learning from leads to code that actually works.
🥈 Aux set — Fable 5 (released today! 🎉). A clever twist: we took the problems where Composer 2.5 got it wrong and handed them to Fable 5 to redo — re-deriving a fresh, self-consistent chain-of-thought and a correct solution, again gated on passing the tests. This recovers the hard cases the main teacher missed. These traces are synthetic (rationalized CoT), and are tagged separately so the two sources stay distinguishable.

The recipe: real CoT for the bulk of solid coverage, plus synthetic "second-attempt" CoT to patch the failures — both verified by execution before anything entered training. ✅

🗺️ Roadmap — v2 (if there's interest! 💚)

This is v1. If the likes / downloads add up, I'll ship a v2 that pushes for the benchmarks 🏁.

📢 Update on v2 & the Fable 5 situation (2026-06-14)

Quick heads-up for everyone waiting on v2:

Fable 5 access has been pulled. The Fable 5 CoT data I managed to save beforehand is honestly a pretty small set — not enough on its own to act as the primary signal for v2 without risking overfitting. So the plan is shifting:

v2 will lean more heavily on Composer 2.5 verifiable CoT as the backbone (the main, execution-verified source), and use the limited Fable 5 data carefully as a supplement rather than the core.
If Fable 5 access doesn't come back within ~a week, I'm considering bringing in GLM-5.2 as an additional teacher. I just went through the benchmarks: per BridgeMind's eval posted on X, GLM-5.2 actually edges out Fable 5 on both the BS and reasoning leaderboards. I haven't tested it hands-on myself yet — my gut says it'll land slightly below Fable 5 in practice, but likely very close.

Bottom line: v2 is still coming. I'd just rather take a little longer and ship something that generalizes than rush out an overfit model. Thanks for the patience and support 💚

⭐ Like & download if you'd like to see v2 — that's the signal I'm watching!

📦 Pick your size (GGUF quants)

Quant	Size	Vibe
🟢 Q2_K	4.5 GB	tiniest — runs almost anywhere
🟡 Q3_K_M	5.7 GB	great for 8 GB VRAM — much better than Q2
🔵 Q4_K_M	6.87 GB	the sweet spot 👌 (recommended)
🟣 Q6_K	9.11 GB	near-lossless
⚪ Q8_0	11.8 GB	basically full quality

🧮 "Will it fit?" — context length cheat-sheet

Rough estimates 🤓 (assumes q8_0 KV cache + ~1.5 GB overhead; use q4_0 KV cache for ≈2× more context!). Max context is 256K. "—" = won't fit, pick a smaller quant. ✂️

Your VRAM / unified mem	🟢 Q2_K (4.5G)	🟡 Q3_K_M (5.7G)	🔵 Q4_K_M (6.87G)	🟣 Q6_K (9.11G)	⚪ Q8_0 (11.8G)
8 GB	~16K ctx	~10K	tight (~2–4K)	—	—
12 GB	~48K	~38K	~30K	~12K	—
16 GB	~80K	~72K	~64K	~44K	~22K
24 GB	~200K	~160K	~128K	~110K	~88K
32 GB	256K (max) 🎉	256K	256K	~230K	~190K

💡 Apple Silicon / integrated GPUs with unified memory count too — same numbers, just slower than a dGPU. 💡 Low on room? Drop a quant or switch KV cache to q4_0 and your context roughly doubles.

🚀 How to run it (super easy)

Option A — llama.cpp (recommended) 🦙

Grab a quant above (e.g. …-Q4_K_M.gguf) and llama-server from llama.cpp.

⚠️ Needs a recent llama.cpp (this is the gemma4_unified architecture — older builds won't load it).
Run a server (Windows .bat shown — tweak --port, --ctx-size to taste):

@echo off
cd /d C:\llama.cpp
llama-server.exe ^
  -m C:\models\gemma4-coding-Q4_K_M.gguf ^
  --ctx-size 16384 ^
  --n-gpu-layers 99 ^
  --no-mmap ^
  -fa on ^
  --cache-type-k q8_0 --cache-type-v q8_0 ^
  --temp 1.0 --top-p 0.95 --top-k 64 ^
  --host 0.0.0.0 --port 18080
pause

Open http://localhost:18080 and chat. 🎉 (Tip: bump --ctx-size per the table; use q4_0 KV for more.)

Option B — one-click apps 🖱️

Works in LM Studio, Jan, Ollama, etc. — just import the GGUF, pick your quant, go. 🐾

🧠 Thinking mode

This model thinks in Gemma's native thought channel before answering — exactly how it was trained. Keep enable_thinking=true (the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64. For coding you can also go greedy (temp 0) for more deterministic solutions.

⚠️ Good to know

Reduced refusals: the training data is task-focused with no safety hedging, so this refuses less than the base model. It is not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
Specialized for Python / algorithmic coding. Reasoning quality is strongest in that domain; general-knowledge facts/numbers should still be double-checked.
English-centric.

📚 Base & License

Base model: google/gemma-4-12B-it. Subject to the Gemma Terms of Use (derivatives must comply).
Personal/hobby project — shared as-is, no warranty. Have fun, and happy hacking! 🐾✨

Downloads last month: 60

Safetensors

Model size

12B params

Tensor type

BF16

Model tree for interpolators/gemma-4-12B-coder-fable5-composer2.5-v1-bf16

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Quantized

yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Finetuned

(1)

this model

Quantizations

2 models