TGI offers GPTQ and bits-and-bytes quantization to quantize large language models.
GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇
Given a layer with weight matrix and layer input, find quantized weight:
TGI allows you to both run an already GPTQ quantized model (see available models here) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing —quantize like below 👇
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize gptq
Note that TGI’s GPTQ implementation doesn’t use AutoGPTQ under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI.
To quantize a given model using GPTQ with a calibration dataset, simply run
text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq # Add --upload-to-model-id MYUSERNAME/falcon-40b to push the created model to the hub directly
This will create a new directory with the quantized files which you can use with,
text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq
You can learn more about the quantization options by running
text-generation-server quantize --help.
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn’t require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
In TGI, you can use 8-bit quantization by adding
--quantize bitsandbytes like below 👇
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (
fp4), or 4-bit
nf4). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
In TGI, you can use 4-bit quantization by adding
--quantize bitsandbytes-nf4 or
--quantize bitsandbytes-fp4 like below 👇
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes-nf4