Using Hugging Face libraries on AMD GPUs
Hugging Face libraries supports natively AMD Instinct MI210 and MI250 GPUs. For other ROCm-powered GPUs, the support has currently not been validated but most features are expected to be used smoothly.
The integration is summarized here.
Flash Attention 2
Flash Attention 2 is available on ROCm (validated on MI210 and MI250) through ROCmSoftwarePlatform/flash-attention library, and can be used in Transformers:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b",
torch_dtype=torch.float16,
use_flash_attention_2=True,
)
We recommend using this example Dockerfile to use Flash Attention on ROCm, or to follow the official installation instructions.
GPTQ quantization
GPTQ quantized models can be loaded in Transformers, using in the backend AutoGPTQ library:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-Chat-GPTQ",
torch_dtype=torch.float16,
)
Hosted wheels are available for ROCm, please check out the installation instructions.
Text Generation Inference library
Hugging Face’s Text Generation Inference library (TGI) is designed for low latency LLMs serving, and natively supports AMD Instinct MI210 and MI250 GPUs from its version 1.2 onwards. Please refer to the Quick Tour section for more details.
Using TGI on ROCm with AMD Instinct MI210 or MI250 GPUs is as simple as using the docker image ghcr.io/huggingface/text-generation-inference:1.2-rocm
.
Detailed benchmarks of Text Generation Inference on MI250 GPUs will soon be published.
ONNX Runtime integration
🤗 Optimum supports running Transformers and Diffusers models through ONNX Runtime on ROCm-powered AMD GPUs. It is as simple as:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
ort_model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
export=True,
provider="ROCMExecutionProvider",
)
inp = tokenizer("Both the music and visual were astounding, not to mention the actors performance.", return_tensors="np")
result = ort_model(**inp)
Check out more details about the support in this guide.
Bitsandbytes quantization
Bitsandbytes (integrated in HF’s Transformers and Text Generation Inference) currently does not officially support ROCm. We are working towards its validation on ROCm and through Hugging Face libraries.
Meanwhile, advanced users may want to use ROCmSoftwarePlatform/bitsandbytes fork for now, or a work in progess community version.
AWQ quantization
AWQ quantization, that is supported in Transformers and Text Generation Inference, is currently not available on ROCm GPUs.
We look forward to a port or to the ongoing developement of a compatible Triton kernel.
< > Update on GitHub