Gemma 4 31B (dense) β€” Core AI

Apple Core AI (.aimodel) conversion of Google's Gemma 4 31B dense text decoder, ported directly from the QAT release google/gemma-4-31B-it-qat-q4_0-unquantized. Decode-only, runs on the stock pipelined engine on Apple Silicon (Mac-class, ~16 GB).

Frontier dense, unblocked by a custom Metal kernel. Gemma 4 31B's full (global) attention layers have a 32-head Γ— 512 Q tensor that overflows MPSGraph's GPU decode scratch heap β€” the stock SDPA crashes at the first token (apple/coreai-models#27, the same bug as the 12B). This bundle ships a custom flash-decode SDPA kernel on the full layers (block-GQA over the 31B's 4 global KV heads) that removes the offending op, so the model runs.

Bundle (gpu-pipelined/)

bundle quant size decode (M4 Max)
gemma4_31b_qat_decode_int4linsym_msdpa_g8 int4 (q4_0-aligned absmax) 19 GB 17.2 tok/s (prefill 22.1)

int4 from Google's QAT checkpoint (q4_0 grid). A frontier 31B at int4 is bandwidth-bound, so decode is in the MLX-parity range β€” the value is "Core AI runs a frontier dense model the stock engine cannot." Mac-only (exceeds the iPhone memory budget). The _g8 suffix is the higher-occupancy flash-decode kernel (8 SIMD-groups per head split the global layers' KV scan; same numerics).

Architecture

Clean dense gemma4 text decoder β€” no PLE / AltUp / Laurel / MoE / KV-sharing. 60 layers, hidden 5376, 32 heads, vocab 262144, softcap 30, tied embeddings. 5:1 sliding:full; dual head_dim (sliding 256 / full global_head_dim 512); full layers use num_global_key_value_heads 4 with attention_k_eq_v (value = raw k_proj). Both attention shapes ride one growing KV pair, so the bundle loads on the stock CoreAIPipelinedEngine (2 states, no engine patch); the full layers' SDPA runs as a custom Metal flash-decode kernel.

Usage

huggingface-cli download mlboydaisuke/Gemma-4-31B-CoreAI \
    --include "gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8/*" \
    --local-dir ./gemma4-31b-coreai

COREAI_CHUNK_THRESHOLD=1 llm-runner \
    --model ./gemma4-31b-coreai/gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8 \
    --prompt "What is the capital of France?" --max-tokens 64 --chunk-size 1

Conversion

Community zoo: github.com/john-rocky/coreai-model-zoo β†’ zoo/gemma4-31b.md.

License

Gemma β€” governed by the Gemma Terms of Use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Gemma-4-31B-CoreAI

Finetuned
(9)
this model