Optimum documentation

Using Hugging Face libraries on AMD GPUs

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v1.19.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Using Hugging Face libraries on AMD GPUs

Hugging Face libraries supports natively AMD Instinct MI210 and MI250 GPUs. For other ROCm-powered GPUs, the support has currently not been validated but most features are expected to be used smoothly.

The integration is summarized here.

Flash Attention 2

Flash Attention 2 is available on ROCm (validated on MI210 and MI250) through ROCmSoftwarePlatform/flash-attention library, and can be used in Transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        "tiiuae/falcon-7b",
        torch_dtype=torch.float16,
        use_flash_attention_2=True,
)

We recommend using this example Dockerfile to use Flash Attention on ROCm, or to follow the official installation instructions.

GPTQ quantization

GPTQ quantized models can be loaded in Transformers, using in the backend AutoGPTQ library:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")

with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/Llama-2-7B-Chat-GPTQ",
        torch_dtype=torch.float16,
    )

Hosted wheels are available for ROCm, please check out the installation instructions.

Text Generation Inference library

Hugging Face’s Text Generation Inference library (TGI) is designed for low latency LLMs serving, and natively supports AMD Instinct MI210 and MI250 GPUs from its version 1.2 onwards. Please refer to the Quick Tour section for more details.

Using TGI on ROCm with AMD Instinct MI210 or MI250 GPUs is as simple as using the docker image ghcr.io/huggingface/text-generation-inference:1.2-rocm.

Detailed benchmarks of Text Generation Inference on MI250 GPUs will soon be published.

ONNX Runtime integration

🤗 Optimum supports running Transformers and Diffusers models through ONNX Runtime on ROCm-powered AMD GPUs. It is as simple as:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

ort_model = ORTModelForSequenceClassification.from_pretrained(
  "distilbert-base-uncased-finetuned-sst-2-english",
  export=True,
  provider="ROCMExecutionProvider",
)

inp = tokenizer("Both the music and visual were astounding, not to mention the actors performance.", return_tensors="np")
result = ort_model(**inp)

Check out more details about the support in this guide.

Bitsandbytes quantization

Bitsandbytes (integrated in HF’s Transformers and Text Generation Inference) currently does not officially support ROCm. We are working towards its validation on ROCm and through Hugging Face libraries.

Meanwhile, advanced users may want to use ROCmSoftwarePlatform/bitsandbytes fork for now, or a work in progess community version.

AWQ quantization

AWQ quantization, that is supported in Transformers and Text Generation Inference, is currently not available on ROCm GPUs.

We look forward to a port or to the ongoing developement of a compatible Triton kernel.

< > Update on GitHub