Using Hugging Face libraries on AMD GPUs

Hugging Face libraries supports natively AMD Instinct MI210, MI250 and MI300 GPUs. For other ROCm-powered GPUs, the support has currently not been validated but most features are expected to be used smoothly.

The integration is summarized here.

Flash Attention 2

Flash Attention 2 is available on ROCm (validated on MI210, MI250 and MI300) through ROCm/flash-attention library, and can be used in Transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        "tiiuae/falcon-7b",
        torch_dtype=torch.float16,
        use_flash_attention_2=True,
)

We recommend using this example Dockerfile to use Flash Attention on ROCm, or to follow the official installation instructions.

GPTQ quantization

GPTQ quantized models can be loaded in Transformers, using in the backend AutoGPTQ library:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")

with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/Llama-2-7B-Chat-GPTQ",
        torch_dtype=torch.float16,
    )

Hosted wheels are available for ROCm, please check out the installation instructions.

Text Generation Inference library

Hugging Face’s Text Generation Inference library (TGI) is designed for low latency LLMs serving, and natively supports AMD Instinct MI210, MI250 and MI3O0 GPUs. Please refer to the Quick Tour section for more details.

Using TGI on ROCm with AMD Instinct MI210 or MI250 or MI300 GPUs is as simple as using the docker image ghcr.io/huggingface/text-generation-inference:latest-rocm.

Detailed benchmarks of Text Generation Inference on MI300 GPUs will soon be published.

ONNX Runtime integration

🤗 Optimum supports running Transformers and Diffusers models through ONNX Runtime on ROCm-powered AMD GPUs. It is as simple as:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

ort_model = ORTModelForSequenceClassification.from_pretrained(
  "distilbert-base-uncased-finetuned-sst-2-english",
  export=True,
  provider="ROCMExecutionProvider",
)

inp = tokenizer("Both the music and visual were astounding, not to mention the actors performance.", return_tensors="np")
result = ort_model(**inp)

Check out more details about the support in this guide.

Bitsandbytes quantization

Bitsandbytes (integrated in HF’s Transformers and Text Generation Inference) currently does not officially support ROCm. We are working towards its validation on ROCm and through Hugging Face libraries.

Meanwhile, advanced users may want to use ROCm/bitsandbytes fork for now. See #issuecomment for more details.

AWQ quantization

AWQ quantization, that is supported in Transformers and Text Generation Inference, is now supported on AMD GPUs using Exllama kernels. With recent optimizations, the AWQ model is converted to Exllama/GPTQ format model at load time. This allows AMD ROCm devices to benefit from the high quality of AWQ checkpoints and the speed of ExllamaV2 kernels combined.

See: AutoAWQ for more details.

Note: Ensure that you have the same PyTorch version that was used to build the kernels.

< > Update on GitHub