Gemma 4 31B Assistant

The Gemma 4 31B it assistant converted to GGUF and is compatible with the latest llama.cpp release (b9549), which introduces support for Gemma 4 MTP (Multi-Token Prediction).

The GGUFs were produced using llama-quantize from the orginal Gemma 4 31b assistant https://huggingface.co/google/gemma-4-31B-it-assistant

Example usage

llama-server -m "gemma-4-31B-it-Q8_0.gguf" \
--spec-draft-model "gemma4-31B-it-assistant-Q8_0.gguf" \
--spec-type draft-mtp \
--spec-draft-n-max 4

Reccomended --spec-draft-n-max values:

  • --spec-draft-n-max 2
  • --spec-draft-n-max 3
  • --spec-draft-n-max 4

The results depend on your workload.

Multi-GPU example usage

It works fine with multiple gpus as well and I've seen a 2x increase in inference. The below example is for a 3 GPU setup.

llama-server -m "gemma-4-31B-it-Q8_0.gguf" \
--spec-draft-model "gemma4-31B-it-assistant-Q8_0.gguf" \
--spec-type draft-mtp \
--spec-draft-n-max 4 \
--main-gpu 0 \
--tensor-split 0.6,0.1,0.3
Downloads last month
2,694
GGUF
Model size
0.5B params
Architecture
gemma4-assistant
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for NotMe404/gemma-4-31b-it-assistant-mtp-gguf

Quantized
(6)
this model