Text Generation Inference improves the model in several aspects.
TGI supports bits-and-bytes, GPT-Q and AWQ quantization. To speed up inference with quantization, simply set
quantize flag to
awq depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models here when using AWQ quantization, you need to point to one of the models here. To get more information about quantization, please refer to quantization guide
RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass
--rope-factors flags when running through CLI.
--rope-scaling can take the values
dynamic. If your model is not fine-tuned to a longer sequence length, use
--rope-factor is the ratio between the intended max sequence length and the model’s original max sequence length. Make sure to pass
--max-input-length to provide maximum input length for extension.
We recommend using
dynamic RoPE scaling.
Safetensors is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports
safetensors model loading under the hood. By default, given a repository with
pytorch weights, TGI will always load
safetensors. If there’s no
pytorch weights, TGI will convert the weights to