Gemma 4 12B-it Assistant β€” Core AI (MTP draft model)

INT4 Core AI (.aimodel) conversion of google/gemma-4-12B-it-assistant, the multi-token-prediction draft model for Gemma 4 12B-it. Companion to warshanks/gemma-4-12B-it-coreai β€” used for speculative decoding in Wyvern Chat's on-device Core AI provider (macOS 27+, Apple silicon).

Contents

Single inference function draft:

name shape dtype
input input_ids [1, 1] int32
input backbone_hidden [1, 1, 3840] bf16
input position_ids [1, S] int32
state k_cache / v_cache [48, 1, 8, ctx, 256] bf16
output next_token [1, 1] int32
output hidden [1, 1, 3840] bf16

The draft cross-attends the main model's KV cache (layers 46/47) β€” pass the same Metal buffers used for the main bundle, zero-copy. backbone_hidden is the main model's post-final-norm hidden state (the hidden output of the main bundle's main/prefill_multimodal functions). The bundle embeds an INT4 copy of the main model's embedding table, so each draft step needs only the previous token id, not its embedding.

Drafted greedily (in-graph argmax over [0, 255999) β€” special/multimodal tokens are never proposed). ~3 ms/step on an M4 Max vs ~21 ms for the 12B.

Conversion

Exported with Apple's coreai-torch / coreai-models toolchain (INT4 block-32 weight-only, symmetric with clipping). Numerics verified against the HF reference implementation (logits max-abs-err 3e-4 at S=1500).

Modifications from the original weights: INT4 quantization; the unused centroid masked-embedder path is dropped (use_ordered_embeddings: false); the main model's embedding table is bundled in.

License

Apache 2.0, same as the base model. Original copyright Google DeepMind. See LICENSE and NOTICE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for warshanks/gemma-4-12B-it-assistant-coreai

Finetuned
(5)
this model