gemma4-e4b-v13-assistant-rollout-mlx-bf16

MLX-swift bf16 Gemma-4 MTP draft assistant, rollout-distilled for Scribion's German medical fact-extraction target (gemma4-e4b-v13-plainlora-r16). Drop-in for the Scribion Gemma4MTPTokenIterator — identical key set / shapes / dtype to mlx-community/gemma-4-E4B-it-assistant-bf16, only the weights differ.

What changed vs the stock assistant

EAGLE-style multi-step rollout distillation against the finetuned v13 target, on in-domain extraction data (biased toward long dialogue transcripts). The assistant's own post_projection feature is rolled through k=6 draft steps (tokens teacher-forced), trained to match the target's next-token predictions — which lifts deep-draft acceptance (the regime where the stock assistant falls off on long, less-predictable dialogues).

Speculative decoding is exact, so output is identical to the target's greedy decode — this is a pure decode-speed change with no quality effect.

Acceptance (transformers reference engine, fixed draft length k, greedy)

froehlich-krause (long-dialogue clip), accepted tokens per target step:

k stock this model
5 3.37 3.56
7 3.76 4.27 (+13.6%)
9 3.88 4.34 (+11.9%)

Trades a little shallow-draft (k=3) acceptance for the deep-draft gain. arztbericht (already near-optimal) is ~flat. Accept-rate transfers to mlx-swift; wall-clock speedup is device-dependent.

Downloads last month
80
Safetensors
Model size
78.8M params
Tensor type
I64
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mediform/gemma4-e4b-v13-assistant-rollout-mlx-bf16

Finetuned
(5)
this model