Instructions to use sjakek/gemma4-12b-mtp-assistant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sjakek/gemma4-12b-mtp-assistant with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sjakek/gemma4-12b-mtp-assistant",
	filename="gemma-4-12B-it-assistant-BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sjakek/gemma4-12b-mtp-assistant with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sjakek/gemma4-12b-mtp-assistant:BF16
# Run inference directly in the terminal:
llama-cli -hf sjakek/gemma4-12b-mtp-assistant:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sjakek/gemma4-12b-mtp-assistant:BF16
# Run inference directly in the terminal:
llama-cli -hf sjakek/gemma4-12b-mtp-assistant:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sjakek/gemma4-12b-mtp-assistant:BF16
# Run inference directly in the terminal:
./llama-cli -hf sjakek/gemma4-12b-mtp-assistant:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sjakek/gemma4-12b-mtp-assistant:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sjakek/gemma4-12b-mtp-assistant:BF16

Use Docker

docker model run hf.co/sjakek/gemma4-12b-mtp-assistant:BF16

LM Studio
Jan

vLLM

How to use sjakek/gemma4-12b-mtp-assistant with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sjakek/gemma4-12b-mtp-assistant"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sjakek/gemma4-12b-mtp-assistant",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/sjakek/gemma4-12b-mtp-assistant:BF16

Ollama
How to use sjakek/gemma4-12b-mtp-assistant with Ollama:
```
ollama run hf.co/sjakek/gemma4-12b-mtp-assistant:BF16
```

Unsloth Studio

How to use sjakek/gemma4-12b-mtp-assistant with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sjakek/gemma4-12b-mtp-assistant to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sjakek/gemma4-12b-mtp-assistant to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sjakek/gemma4-12b-mtp-assistant to start chatting

Docker Model Runner
How to use sjakek/gemma4-12b-mtp-assistant with Docker Model Runner:
```
docker model run hf.co/sjakek/gemma4-12b-mtp-assistant:BF16
```

Lemonade

How to use sjakek/gemma4-12b-mtp-assistant with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sjakek/gemma4-12b-mtp-assistant:BF16

Run and chat with the model

lemonade run user.gemma4-12b-mtp-assistant-BF16

List all available models

lemonade list

Gemma 4 12B IT MTP Assistant GGUF

GGUF conversion of Google's google/gemma-4-12B-it-assistant draft / MTP assistant model for pairing with google/gemma-4-12B-it-compatible Gemma 4 12B instruction checkpoints.

These files are generated for the current Gemma 4 MTP llama.cpp work, not stock mainline builds that reject the assistant architecture. Local runtime validation used am17an/llama.cpp branch gemma4-mtp at commit b8e703e, with a converter compatibility patch so the official Gemma4UnifiedAssistantForCausalLM config registers as the Gemma 4 assistant converter.

Requested repo was atx/gemma4-12b-mtp-assistant; the active token is for sjakek and has no atx org write rights, so this package was uploaded under sjakek/gemma4-12b-mtp-assistant.

Files

gemma-4-12B-it-assistant-BF16.gguf: BF16 GGUF, 861,520,128 bytes.
gemma-4-12B-it-assistant-Q8_0.gguf: Q8_0 GGUF from branch llama-quantize, 465,109,248 bytes.
logs/: conversion, quantization, load, generation, and long-context validation logs.

Source

Source model: google/gemma-4-12B-it-assistant
Source revision: 723932f88886ab714522468b94f9c7ee48d8c9a8
Source config architecture: Gemma4UnifiedAssistantForCausalLM
Source model type: gemma4_unified_assistant
GGUF architecture emitted here: gemma4-assistant

GGUF Metadata

Structural audit passed on both files:

general.architecture = gemma4-assistant
gemma4-assistant.context_length = 131072
gemma4-assistant.embedding_length = 1024
gemma4-assistant.embedding_length_out = 3840
gemma4-assistant.block_count = 4
gemma4-assistant.feed_forward_length = 8192
gemma4-assistant.attention.head_count = 16
gemma4-assistant.attention.head_count_kv = [8, 8, 8, 1]
gemma4-assistant.attention.shared_kv_layers = 4
gemma4-assistant.nextn_predict_layers = 4
gemma4-assistant.attention.sliding_window_pattern = [true, true, true, false]
gemma4-assistant.rope.dimension_count = 512
gemma4-assistant.rope.dimension_count_swa = 256
tokenizer.ggml.bos_token_id = 2
tokenizer.ggml.eos_token_id = 1
Tensor count: 49
Required tensors present: rope_freqs.weight, nextn.pre_projection.weight, nextn.post_projection.weight

Runtime Validation

Validated locally against the target GGUF:

gemma-4-12b-it-UD-Q6_K_XL.gguf

Test results:

BF16 assistant load and generation: pass.
Q8_0 assistant load and generation: pass.
Speculative mode: --spec-type draft-mtp.
Draft KV lane for Q8_0 test: --spec-draft-type-k q8_0 --spec-draft-type-v q8_0.
Serving shape: --ctx-size 131072 --batch-size 4096 --ubatch-size 512 --flash-attn on.
Target-only vs MTP deterministic check at temperature=0: decoded output matched.
BF16 acceptance probe: 6 accepted / 6 generated draft tokens.
Q8_0 acceptance probe: 6 accepted / 6 generated draft tokens.
Long-context smoke: 126,009 prompt tokens plus 8 generated tokens, truncated=false, no OOM or context error.
OpenAI-compatible chat endpoint: pass with draft activity observed.

Relevant retained logs:

logs/convert-branch-bf16.log
logs/quantize-branch-q8_0.log
logs/completion-target-baseline.json
logs/completion-bf16-mtp.json
logs/completion-q8-mtp.json
logs/completion-q8-mtp-long-context-summary.json
logs/server-final-q8-mtp-canonical.log
logs/openai-chat-final-q8-canonical.json

Benchmarks

Additional local benchmark reports are included in this repository:

benchmarks/gemma4_q4xl_mtp_2k_64k: single-stream Q4_K_XL target-only vs Q8 MTP, draft max 3, temperature 0.
benchmarks/gemma4_q4xl_mtp_2k_64k_draft2_temp06_topk1: single-stream Q4_K_XL target-only vs Q8 MTP, draft max 2, drafter top-k 1, accepter temperature 0.6.
benchmarks/gemma4_q4xl_mtp_2k_64k_draft2_temp06_topk1_parallel3: three concurrent 2K-in/2K-out streams, --parallel 3, draft max 2, drafter top-k 1, accepter temperature 0.6. This was the first local run where MTP beat target-only on aggregate generation throughput:
- coding: 47.14 tok/s target-only vs 50.57 tok/s MTP
- general: 48.42 tok/s target-only vs 50.60 tok/s MTP

Checksums

be2ff6cf6dc9f4d753be846efb990606a5fec1b9c758c7f200112d2431f5e248  gemma-4-12B-it-assistant-BF16.gguf
cb9b46d9ff820b2b9b0d53cc911a2bc27eb2faf84700284047244d8f28883794  gemma-4-12B-it-assistant-Q8_0.gguf

llama.cpp Example

Use a Gemma 4 MTP-capable llama.cpp branch/build:

llama-server \
  -m gemma-4-12b-it-UD-Q6_K_XL.gguf \
  --model-draft gemma-4-12B-it-assistant-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --spec-draft-type-k q8_0 \
  --spec-draft-type-v q8_0 \
  --ctx-size 131072 \
  --batch-size 4096 \
  --ubatch-size 512 \
  --flash-attn on \
  --n-gpu-layers 999 \
  --n-gpu-layers-draft 999 \
  --fit off \
  --jinja

Stock llama.cpp builds that do not include Gemma 4 assistant support may fail with unknown model architecture: 'gemma4-assistant'.

Downloads last month: 465

GGUF

Model size

0.4B params

Architecture

gemma4-assistant

Hardware compatibility

8-bit

16-bit

Model tree for sjakek/gemma4-12b-mtp-assistant

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Quantized

(61)

this model