Gemma 4 12B-it Assistant β Core AI (MTP draft model)
INT4 Core AI (.aimodel) conversion of google/gemma-4-12B-it-assistant, the multi-token-prediction draft model for Gemma 4 12B-it. Companion to warshanks/gemma-4-12B-it-coreai β used for speculative decoding in Wyvern Chat's on-device Core AI provider (macOS 27+, Apple silicon).
Contents
Single inference function draft:
| name | shape | dtype | |
|---|---|---|---|
| input | input_ids |
[1, 1] | int32 |
| input | backbone_hidden |
[1, 1, 3840] | bf16 |
| input | position_ids |
[1, S] | int32 |
| state | k_cache / v_cache |
[48, 1, 8, ctx, 256] | bf16 |
| output | next_token |
[1, 1] | int32 |
| output | hidden |
[1, 1, 3840] | bf16 |
The draft cross-attends the main model's KV cache (layers 46/47) β pass
the same Metal buffers used for the main bundle, zero-copy. backbone_hidden
is the main model's post-final-norm hidden state (the hidden output of the
main bundle's main/prefill_multimodal functions). The bundle embeds an
INT4 copy of the main model's embedding table, so each draft step needs only
the previous token id, not its embedding.
Drafted greedily (in-graph argmax over [0, 255999) β special/multimodal tokens are never proposed). ~3 ms/step on an M4 Max vs ~21 ms for the 12B.
Conversion
Exported with Apple's coreai-torch / coreai-models toolchain (INT4 block-32 weight-only, symmetric with clipping). Numerics verified against the HF reference implementation (logits max-abs-err 3e-4 at S=1500).
Modifications from the original weights: INT4 quantization; the unused
centroid masked-embedder path is dropped (use_ordered_embeddings: false);
the main model's embedding table is bundled in.
License
Apache 2.0, same as the base model. Original copyright Google DeepMind. See LICENSE and NOTICE.
Model tree for warshanks/gemma-4-12B-it-assistant-coreai
Base model
google/gemma-4-12B-it-assistant