Q8_0 Quant of the google assistant model (MTP) for the 26B-A4B. Works with latest llama.cpp (9551). All the other quants uploaded seem to have non matching architecture so failed to load for me [1].

Gives me 100ts/s up from about 85t/s without MTP (--spec-draft-max 2) on a fresh task.

.38.515.475 I slot print_timing: id  3 | task 0 | prompt eval time =    1147.24 ms /  2419 tokens (    0.47 ms per token,  2108.55 tokens per second)
0.38.515.479 I slot print_timing: id  3 | task 0 |        eval time =   14292.86 ms /  1432 tokens (    9.98 ms per token,   100.19 tokens per second)
0.38.515.481 I slot print_timing: id  3 | task 0 |       total time =   15440.10 ms /  3851 tokens
0.38.515.485 I slot print_timing: id  3 | task 0 |    graphs reused =        668
0.38.515.486 I slot print_timing: id  3 | task 0 | draft acceptance = 0.56231 (  758 accepted /  1348 generated)
0.38.515.506 I statistics        draft-mtp: #calls(b,g,a) =    1    674    674, #gen drafts =    674, #acc drafts =   461, #gen tokens =   1348, #acc tokens =   758, dur(b,g,a) = 0.003, 2567.398, 1.100 ms

[1] e.g. stuff like this when you try to load them

0.01.002.728 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 1279.96 MiB
0.01.314.746 E llama_model_load: error loading model: unknown model architecture: 'gemma4_mtp'
0.01.314.762 E llama_model_load_from_file_impl: failed to load model
0.01.314.822 W srv    load_model: [spec] failed to measure draft model memory: failed to load model
Downloads last month
392
GGUF
Model size
0.4B params
Architecture
gemma4-assistant
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zeroxjason200/google_gemma-4-26B-A4B-it-assistant

Quantized
(9)
this model