Can this thing be quantized?

#6
by Winmodel - opened

Can this thing be quantized? or its recurrent block is sensitive to low precision like Mamba?

Hi @Winmodel
I think as long as the model contains torch.nn.Linear layers it can be quantized with any of these methods: https://huggingface.co/docs/transformers/quantization

Google org

We haven't investigated quantizing RecurrentGemma/Griffin models yet.
I would expect that the linear layers (which has most of the weights) can be quantized, but it might be harder to quantize the recurrent layer.

Note that for the recurrence, we recommend storing the parameters in bfloat16, but casting them on device to float32 to perform the recurrence in high precision.
Casting the parameters to float32 on device is free since the recurrence is memory bound, and it significantly improves numerical precision.

Thanks for this great model! Is there a gguf version of this yet?

Hello @sdalemorrey .

We do not have a gguf version of the model and at the moment we do not plan to release one.

In case it might be helpful, it looks like someone did a version of a quantized model here[1].
NB: This was not released by us and we did not test or evaluate it.

[1] https://huggingface.co/PrunaAI/recurrentgemma-2b-it-bnb-4bit-smashed

Sign up or log in to comment