Training an INT4 version of the 7B model

#14
by Raspbfox - opened

Hey hey, I am currently working on a mobile-friendly runtime for RWKV and one of the areas I am really curious about, is aggressive quantization, as storage and RAM spaces of mobile devices are extremely limited.

At the moment, I am experimenting with dynamic INT4 quantization of your pre-trained models, however, it made me think: maybe we can natively train at least one decently sized model fully in INT4?
Or, at least, use the existing model as a "teacher" and make it train a quantized "student" model.

What do you think?

This should, in theory, reduce the size of the 7B model to ~2GBs, which is in the real of acceptable on mobile devices 👀

It is not possible to train quantized models in int4 because the delta of the weights during training iteration will be less than the quantum, and the training efficiency will become zero. That's why quantization is only used for inference.

@cha0tik , from what I understand, it doesn't directly mean training a whole new model internally using int4 values, rather it means training (or additionally training) a model to take issues of very-low-precision quantization into account.

https://intellabs.github.io/distiller/quantization.html#quantization-aware-training

As mentioned above, in order to minimize the loss of accuracy from "aggressive" quantization, many methods that target INT4 and lower (and in some cases for INT8 as well) involve training the model in a way that considers the quantization. This means training with quantization of weights and activations "baked" into the training procedure.

https://arxiv.org/abs/1606.06160

As most convolutions during forward/backward passes are now taking low bitwidth weights and
activations/gradients respectively, DoReFa-Net can use the bit convolution kernels to accelerate both
training and inference process. Our experiments on SVHN and ImageNet datasets demonstrate that
DoReFa-Net can achieve comparable prediction accuracy as their 32-bit counterparts. For example,
a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from
scratch using 6-bit gradients to get 46.1% top-1 accuracy on ImageNet validation set.

So why not look into a 4-bit (additionally trained) quantized set of models, or even use a combination of 1-,2- and 4-bit quantization in appropriate places?

@Raspbfox @BlinkDL Are there any efforts in quantizations for RWKV GPU models? I know the ggml port of RWKV already uses ggml's quantization methods. What about RWKV's GPTQ port?

If GPTQ's compression methods apply to RWKV, we should be able to just quantize the original model with acceptable minimal losses. This will help deploy larger models (future 25B, 50B, 100B RWKV models) with smaller VRAM and even faster inferencing speed.

Sign up or log in to comment