4bit GPTQ quantized version of https://huggingface.co/tiiuae/falcon-40b-instruct

Make sure to run with FlashAttention like in https://github.com/huggingface/text-generation-inference

Also note the GPTQ 4bit quantized version seems to run about 2x slower compared to the 8bit bitsandbytes version within text-generation-inference, typically we were seeing about 600-800ms latency for token generation for 8bit bitsandbytes whereas we're seeing about 1.2-1.7s with the 4bit GPTQ version.

VRAM usage is a little over 25gb for this 4bit quantized version, compared to 47gb for the 8bit and 80gb for full.

This was quantized using:

text-generation-server quantize tiiuae/falcon-40b-instruct /tmp/falcon40instructgptq --upload-to-model-id AxisMind/falcon-40b-instruct-gptq --trust-remote-code --act-order

Huggingface's GPTQ implementation can be found here: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/gptq/quantize.py

For testing and degradation purposes we've not yet looked at anything thoroughly, but for our usecases we did not notice any significant quality degradation which is inline with the claims of the GPTQ paper compared to other low bit quantization methods.

Downloads last month
17
Safetensors
Model size
6.53B params
Tensor type
I64
I32
FP16
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.