google/recurrentgemma-2b-it · Can this thing be quantized?

Apr 10

Can this thing be quantized? or its recurrent block is sensitive to low precision like Mamba?

Apr 10

Hi @Winmodel
I think as long as the model contains torch.nn.Linear layers it can be quantized with any of these methods: https://huggingface.co/docs/transformers/quantization

slsmith

Google org Apr 10

We haven't investigated quantizing RecurrentGemma/Griffin models yet.
I would expect that the linear layers (which has most of the weights) can be quantized, but it might be harder to quantize the recurrent layer.

Note that for the recurrence, we recommend storing the parameters in bfloat16, but casting them on device to float32 to perform the recurrence in high precision.
Casting the parameters to float32 on device is free since the recurrence is memory bound, and it significantly improves numerical precision.

sdalemorrey

Apr 16

Thanks for this great model! Is there a gguf version of this yet?

gmuraru

Apr 23

Hello @sdalemorrey .

We do not have a gguf version of the model and at the moment we do not plan to release one.

In case it might be helpful, it looks like someone did a version of a quantized model here[1].
NB: This was not released by us and we did not test or evaluate it.

[1] https://huggingface.co/PrunaAI/recurrentgemma-2b-it-bnb-4bit-smashed