Efficient Inference on a Single GPU
This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out the guide for training on a single GPU and the guide for inference on CPUs.
BetterTransformer
for faster inference
We have recently integrated BetterTransformer
for faster inference on GPU for text, image and audio models. Check the documentation about this integration here for more details.
bitsandbytes
integration for Int8 mixed-precision matrix decomposition
Note that this feature can also be used in a multi GPU setup.
From the paper LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale
, we support Hugging Face integration for all models in the Hub with a few lines of code.
The method reduces nn.Linear
size by 2 for float16
and bfloat16
weights and by 4 for float32
weights, with close to no impact to the quality by operating on the outliers in half-precision.
Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models. For more details regarding the method, check out the paper or our blogpost about the integration.
Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature. Below are some notes to help you use this module, or follow the demos on Google colab.
Requirements
- If you have
bitsandbytes<0.37.0
, make sure you run on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100). Forbitsandbytes>=0.37.0
, all GPUs should be supported. - Install the correct version of
bitsandbytes
by running:pip install bitsandbytes>=0.31.5
- Install
accelerate
pip install accelerate>=0.12.0
Running mixed-Int8 models - single GPU setup
After installing the required libraries, the way to load your mixed 8-bit model is as follows:
from transformers import AutoModelForCausalLM
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
For text generation, we recommend:
- using the model’s
generate()
method instead of thepipeline()
function. Although inference is possible with thepipeline()
function, it is not optimized for mixed-8bit models, and will be slower than using thegenerate()
method. Moreover, some sampling strategies are like nucleaus sampling are not supported by thepipeline()
function for mixed-8bit models. - placing all inputs on the same device as the model.
Here is a simple example:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "bigscience/bloom-2b5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
text = "Hello, my llama is cute"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
Running mixed-int8 models - multi GPU setup
The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup):
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
But you can control the GPU RAM you want to allocate on each GPU using accelerate
. Use the max_memory
argument as follows:
max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)
In this example, the first GPU will use 1GB of memory and the second 2GB.
Colab demos
With this method you can infer on models that were not possible to infer on a Google Colab before. Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab:
Or this demo for BLOOM-3B: