gemma4-e4b-v13-assistant-rollout-mlx-bf16

MLX-swift bf16 Gemma-4 MTP draft assistant, rollout-distilled for Scribion's German medical fact-extraction target (gemma4-e4b-v13-plainlora-r16). Drop-in for the Scribion Gemma4MTPTokenIterator — identical key set / shapes / dtype to mlx-community/gemma-4-E4B-it-assistant-bf16, only the weights differ.

What changed vs the stock assistant

EAGLE-style multi-step rollout distillation against the finetuned v13 target, on in-domain extraction data (biased toward long dialogue transcripts). The assistant's own post_projection feature is rolled through k=6 draft steps (tokens teacher-forced), trained to match the target's next-token predictions — which lifts deep-draft acceptance (the regime where the stock assistant falls off on long, less-predictable dialogues).

Speculative decoding is exact, so output is identical to the target's greedy decode — this is a pure decode-speed change with no quality effect.

Acceptance (transformers reference engine, fixed draft length k, greedy)

froehlich-krause (long-dialogue clip), accepted tokens per target step:

k	stock	this model
5	3.37	3.56
7	3.76	4.27 (+13.6%)
9	3.88	4.34 (+11.9%)

Trades a little shallow-draft (k=3) acceptance for the deep-draft gain. arztbericht (already near-optimal) is ~flat. Accept-rate transfers to mlx-swift; wall-clock speedup is device-dependent.

Downloads last month: 80

Safetensors

Model size

78.8M params

Tensor type

I64

BF16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mediform/gemma4-e4b-v13-assistant-rollout-mlx-bf16

Base model

google/gemma-4-E4B-it-assistant

Finetuned

(5)

this model