Can this be fine-tuned using Amazon SageMaker or run on a AMD GPU that is not CUDA-enabled?

#18
by Bigshot - opened

I have attempted to run this model on a AMD GPU but it seems that CUDA is a requirement for Flash Attention.
Because of this problem I have resorted to trying to use SageMaker to fine-tune it.
Will this work?
Can you add support for AMD GPUs as well?
How can I fix this?

Bigshot changed discussion title from Can this be fine-tuned using Amazon Sagemaker or run on a AMD GPU that is not CUDA-enabled? to Can this be fine-tuned using Amazon SageMaker or run on a AMD GPU that is not CUDA-enabled?

See: https://www.reddit.com/r/MachineLearning/comments/xjnozq/d_whats_the_word_on_amd_gpus_these_days/
And: https://www.amd.com/en/graphics/servers-solutions-rocm

In summary:

  • Support for ROCm is not great at the consumer level and pytorch must be compiled from source.
  • At the server level, you will find AMD compute instances with ROCm precompiled, at a pricetag.
  • Flash Attention runtime kernels are composed for NVIDIA specific server products.

This will not allow Flash Attention, but as the good news:

  • Compiling from source will allow AMD GPU accelerated pytorch with ROCm. This does provide AMD users with an option.
  • HF transformers.accelerator as a direct use-case becomes available (uses pytorch to split the model across several devices, when memory resources run low) -- Multiple AMD GPUs can be utilized in a load balanced configuration with block split-locking, but without Flash Attention. transformers.accelerator is hardware agnostic. Its a load balancer, and doesn't include runtime kernels, just ways to split the tensor blocks and send jobs to other devices available.

In LT;DR:

If you want this locally: From the reported posts, you will run into circular dependency problems so the best approach is to compile the ROCm compatible version of pytorch in a docker contrainer, and either put it into a WSL or a native Ubuntu 22.04 install as a precompiled .whl or .egg package when completed (with your drivers installed), which will allow GPU accelerated pytrorch and HF transformer accelerators offloading. (currently the best solution I've found consumer-level is this, as I don't need VRAM, I just need a lot of RAM -- cheaper compute instances, too)

This says nothing about compatibility with Flash Attention. Users reportedly got txt2img models working in this manner, but only on a custom-compiled version of pytorch. But in light of, its highly recommended that you choose an NVIDIA product for this application, over your time investment, sad to say. AMD has not kept up the same level of compatibility and support to running LLMs as NVIDIA has in the consumer market, HF transformers docs do not indicate AMD is of priority consideration as much of the transformer optimizations currently out there are dependent on NVIDIA-architecture noteworthy tricks and kernel code. Including Flash Attention, Alibi, XFormers, etc.

In short, you need to pay for one of their servers to get access to ROCm on the rail, or you need to switch to NVIDIA if you want it at home without spending hours resolving circular dependency problems. You will gain GPU acceleration, but you will not have Flash Attention as its dependent on NVIDIA-specific kernel level optimization.

NVIDIA is the standard in the ML world, and the highest range of compatibility. Running LLMs with AMD is generally not recommended, and actively discouraged over the time investment in getting it functioning. We had to all bite the bullet and accept it, because AMD themselves just has not kept up with it.

Sadly could not choose Team Red for this application without considering investing months in custom application and hardware-tailored solutions.

Sign up or log in to comment