Accelerating Embedding & Reranking Models on AMD Using Infinity

Community Article Published December 3, 2024

This guest article is written by the maintainer of michaelfeil/infinity, a popular open-source library for inferencing text-embedding, reranking, vision embedding (clip, colqwen) and audio embedding models on AMD with a high throughput engine. This article "contains a tutorial on how to quickly deploy an AMD-based embedding solution on ROCm via Pytorch and ONNX, a comparision how to accomplish the same thing on a Nvidia and a quick optimization guide for AMD.

Making AMD more popular?!

Why AMD GPUs - when talking to people at Hackathons, community meetups and on following discussions on the Localllama subreddit, it often feels that AMD is forgotten, capabilities downplayed, even saying it will take years to catch up. Meanwhile, you can go to your favorite website (https://pytorch.org/get-started/locally/) and have torch with rocm with a one-liner install command. The community is somewhat right - according to anonymized user submissions with infinity in December 2024, barely any AMD GPUs are currently used to run infinity, in last week, only 0.7% of the infinity uses were on AMD.

image/png Image: Infinity usage, Nov26-Dec3-2024, excluding MPS and CPU targets.

As a true open-source project it is interesting to maintain vender neutrality, as long as the software remains easy to maintain - infinty has added Apple MPS support, has best-in-class CPU support and also aims to be compatible with AWS Inferentia in the near future - It’s time to tackle AMD!!

Tutorial

Recap: How Infinity runs via Docker on CUDA and CPU

To use a accelerated image on docker + accelerator specific installation. The accelerated specific installation helps to share the driver specfic details from the docker host & enables sharing of devices.

For nvidia, all you need to do is instlal the nvidia-container-toolkit nvidia-container-toolkit.

Once that is done, the infinity instructions are already updated in a couple of popular repos e.g. snowflake-arctic-embed-m.

# remove # to run example
docker run \
 --gpus all \ # --gpus mounts the NVIDIA GPUS - and requires the nvidia-docker toolkit.
 -p 7997:7997 \ # port forwarding
 michaelf34/infinity:0.0.70 \ # selecting the Dockerfile.nvidia image from infinity
 v2 \
 --model-id Snowflake/snowflake-arctic-embed-m \
 --engine torch # tell infinity to run the model using the pytorch backend engine.

And its running. If we wanted to accomplish the same thing on CPU, we might have been better of with ONNX. Thankfully, the arctic model has an extra set of ONNX weights, located in [`./onnx/](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/tree/main/onnx).

docker run \
 \ # no gpus mounted
 -p 7997:7997 \ # port forwarding
 michaelf34/infinity:0.0.70-cpu \ 
 v2 \
 --model-id Snowflake/snowflake-arctic-embed-m \
 --engine optimum # tell infinity to run the model using the onnx/optimum backend engine.

Tutorial 1: Running Embedding Models on AMD on AMD ROCm

Here’s how to reproduce the example on AMD.

First, we require to have a compatible GPU & have the AMD-container-toolkit installed https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html. For AMD, that piece of software is called ROCm kernel-mode driver, amdgpu-dkms.

With amdgpu-dkms and amd drivers, any AMD pytorch image is ready to be launched.

In this case, the image is based on [rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0] (https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/Dockerfile.amd_auto), requiring a rocm6.2.3 compatible installation on the host. You can pull the image without login from docker hub via michaelf34/infinity:0.0.70-amd

Instead of mounting the GPUs via --gpus we instead add the following to the docker run command.

--security-opt seccomp=unconfined \ # mount the devices for 
--device=/dev/kfd \
--device=/dev/dri \

To put it all together:

docker run -it \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \ 
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  -p 7997:7997 \
  michaelf34/infinity:0.0.70-amd \
  v2 \
 --model-id Snowflake/snowflake-arctic-embed-m \
 --engine torch # tell infinity to run the model using the onnx/optimum backend engine.

Tutorial 2: Running Reranking Models on AMD ROCm via ONNX on MI300x

To run the same model on AMD via onnx, we have to turn on an extra build arg (--build-arg GPU_ARCH=gfx942), building the onnxruntime for the gfx942 (only MI300x!) gpu arch. Also, we select the --engine optimum for onnx inference. For other build targets, you can use michaelf34/infinity:0.0.70-amd-gfx94a (MI200x) and michaelf34/infinity:0.0.70-amd-gfx1100 (selected AMD Radeon Cards). Special thanks for the contributors at embeddedllm.com for this!

docker run -it \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \ # mount the devices for 
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  -p 7997:7997 \
  michaelf34/infinity:0.0.70-amd-gfx942 \ # selecting the Dockerfile.rocm with GPU_ARCH=gfx942 build arch
  v2 \
 --model-id Snowflake/snowflake-arctic-embed-m --engine optimum --device cuda \
 --model-id mixedbread-ai/mxbai-rerank-base-v1 --engine optimum --device cuda # repeating multiple `model-id` allows for multiple model launch.

Tutorial 3: Without the docker run setup (Runpod.io)

,This blog is not sponsored by Runpod. I used it as as of 2024, its a very simple way to run AMD images on AMD-MI300x.

Here is the launch configuration: In the UI, I select the MI300x node & modify port, image=michaelf34/infinity:0.0.70-amd-gfx942, infinity cli version (use v2!) and both models: --model-id Snowflake/snowflake-arctic-embed-m --engine optimum --model-id mixedbread-ai/mxbai-rerank-base-v1 --engine optimum.

image/png

As soon as the startup is complete, you should see the following printed to the logs: INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit) in the console.

Also, the log contains some basic performance stats collected during warmup: 487 (len=512)- 12021 (len=3) embeddings/second & up to 4836 rerank requests/second. This is excluding the API + dynamic batching overhead, but still very decent, especially for short requests! A single replica could handle potentially up to 1000 concurrent RAG users sending short queries - not bad at all! At this point, likely other parts of the infrastructure become the bottleneck!

Click here to show full logs
2024-12-03T08:43:23.843767712Z INFO:     Waiting for application startup.
2024-12-03T08:43:23.847049714Z INFO     2024-12-03 08:43:23,843 infinity_emb INFO:        infinity_server.py:92
2024-12-03T08:43:23.847099298Z          Creating 2engines:
2024-12-03T08:43:23.847104256Z          engines=['Snowflake/snowflake-arctic-embed-m',
2024-12-03T08:43:23.847108382Z          'mixedbread-ai/mxbai-rerank-base-v1']
2024-12-03T08:43:23.847677370Z INFO     2024-12-03 08:43:23,846 infinity_emb INFO: Anonymized   telemetry.py:30
2024-12-03T08:43:23.847687605Z          telemetry can be disabled via environment variable
2024-12-03T08:43:23.847690820Z          `DO_NOT_TRACK=1`.
2024-12-03T08:43:23.852227822Z INFO     2024-12-03 08:43:23,851 infinity_emb INFO:           select_model.py:64
2024-12-03T08:43:23.852244597Z          model=`Snowflake/snowflake-arctic-embed-m` selected,
2024-12-03T08:43:23.852248152Z          using engine=`optimum` and device=`cuda`
2024-12-03T08:43:24.307934415Z INFO     2024-12-03 08:43:24,304 infinity_emb INFO: Found 7 utils_optimum.py:244
2024-12-03T08:43:24.307985381Z          onnx files: [PosixPath('onnx/model.onnx'),
2024-12-03T08:43:24.307990809Z          PosixPath('onnx/model_bnb4.onnx'),
2024-12-03T08:43:24.307994915Z          PosixPath('onnx/model_fp16.onnx'),
2024-12-03T08:43:24.307998901Z          PosixPath('onnx/model_int8.onnx'),
2024-12-03T08:43:24.308002676Z          PosixPath('onnx/model_q4.onnx'),
2024-12-03T08:43:24.308006482Z          PosixPath('onnx/model_quantized.onnx'),
2024-12-03T08:43:24.308010748Z          PosixPath('onnx/model_uint8.onnx')]
2024-12-03T08:43:24.309548020Z INFO     2024-12-03 08:43:24,307 infinity_emb INFO: Using   utils_optimum.py:248
2024-12-03T08:43:24.309575701Z          onnx/model.onnx as the model
2024-12-03T08:43:24.707931233Z The ONNX file onnx/model.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
2024-12-03T08:43:33.299022861Z 2024-12-03 08:43:33.298756964 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-12-03T08:43:33.299078624Z 2024-12-03 08:43:33.298775843 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-12-03T08:43:34.174298204Z INFO     2024-12-03 08:43:34,171 infinity_emb INFO: Getting   select_model.py:97
2024-12-03T08:43:34.174339565Z          timings for batch_size=32 and avg tokens per
2024-12-03T08:43:34.174344533Z          sentence=3
2024-12-03T08:43:34.174348629Z                  0.38     ms tokenization
2024-12-03T08:43:34.174352464Z                  2.22     ms inference
2024-12-03T08:43:34.174356300Z                  0.06     ms post-processing
2024-12-03T08:43:34.174360326Z                  2.66     ms total
2024-12-03T08:43:34.174364172Z          embeddings/sec: 12021.05
2024-12-03T08:43:34.326102645Z INFO     2024-12-03 08:43:34,324 infinity_emb INFO: Getting  select_model.py:103
2024-12-03T08:43:34.326140922Z          timings for batch_size=32 and avg tokens per
2024-12-03T08:43:34.326145960Z          sentence=512
2024-12-03T08:43:34.326177467Z                  7.80     ms tokenization
2024-12-03T08:43:34.326181933Z                  57.67    ms inference
2024-12-03T08:43:34.326186220Z                  0.13     ms post-processing
2024-12-03T08:43:34.326190196Z                  65.60    ms total
2024-12-03T08:43:34.326194262Z          embeddings/sec: 487.80
2024-12-03T08:43:34.326716801Z INFO     2024-12-03 08:43:34,325 infinity_emb INFO: model    select_model.py:104
2024-12-03T08:43:34.326738783Z          warmed up, between 487.80-12021.05 embeddings/sec
2024-12-03T08:43:34.326748608Z          at batch_size=32
2024-12-03T08:43:34.332365460Z INFO     2024-12-03 08:43:34,331 infinity_emb INFO:           select_model.py:64
2024-12-03T08:43:34.332386802Z          model=`mixedbread-ai/mxbai-rerank-base-v1` selected,
2024-12-03T08:43:34.332391909Z          using engine=`optimum` and device=`cuda`
2024-12-03T08:43:34.737521909Z INFO     2024-12-03 08:43:34,736 infinity_emb INFO: Found 2 utils_optimum.py:244
2024-12-03T08:43:34.737551914Z          onnx files: [PosixPath('onnx/model.onnx'),
2024-12-03T08:43:34.737554037Z          PosixPath('onnx/model_quantized.onnx')]
2024-12-03T08:43:34.738028694Z INFO     2024-12-03 08:43:34,737 infinity_emb INFO: Using   utils_optimum.py:248
2024-12-03T08:43:34.738033541Z          onnx/model.onnx as the model
2024-12-03T08:43:35.260076178Z The ONNX file onnx/model.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
2024-12-03T08:43:42.332373238Z 2024-12-03 08:43:42.332105549 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-12-03T08:43:42.332428290Z 2024-12-03 08:43:42.332129364 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-12-03T08:43:43.528743277Z INFO     2024-12-03 08:43:43,526 infinity_emb INFO: Getting   select_model.py:97
2024-12-03T08:43:43.528793812Z          timings for batch_size=32 and avg tokens per
2024-12-03T08:43:43.528802886Z          sentence=4
2024-12-03T08:43:43.528809295Z                  0.49     ms tokenization
2024-12-03T08:43:43.528814814Z                  6.12     ms inference
2024-12-03T08:43:43.528820282Z                  0.01     ms post-processing
2024-12-03T08:43:43.528825580Z                  6.62     ms total
2024-12-03T08:43:43.528831038Z          embeddings/sec: 4836.28
2024-12-03T08:43:43.859858228Z INFO     2024-12-03 08:43:43,858 infinity_emb INFO: Getting  select_model.py:103
2024-12-03T08:43:43.859906941Z          timings for batch_size=32 and avg tokens per
2024-12-03T08:43:43.859912008Z          sentence=512
2024-12-03T08:43:43.859916225Z                  25.66    ms tokenization
2024-12-03T08:43:43.859920161Z                  106.47   ms inference
2024-12-03T08:43:43.859923996Z                  0.03     ms post-processing
2024-12-03T08:43:43.859927862Z                  132.16   ms total
2024-12-03T08:43:43.859952338Z          embeddings/sec: 242.13
2024-12-03T08:43:43.860544431Z INFO     2024-12-03 08:43:43,859 infinity_emb INFO: model    select_model.py:104
2024-12-03T08:43:43.860571682Z          warmed up, between 242.13-4836.28 embeddings/sec at
2024-12-03T08:43:43.860577340Z          batch_size=32
2024-12-03T08:43:43.861875996Z INFO     2024-12-03 08:43:43,861 infinity_emb INFO:         batch_handler.py:443
2024-12-03T08:43:43.861889265Z          creating batching engine
2024-12-03T08:43:43.865405636Z INFO     2024-12-03 08:43:43,862 infinity_emb INFO: ready   batch_handler.py:512
2024-12-03T08:43:43.865426647Z          to batch requests.
2024-12-03T08:43:43.866914465Z INFO     2024-12-03 08:43:43,866 infinity_emb INFO:         batch_handler.py:443
2024-12-03T08:43:43.866928997Z          creating batching engine
2024-12-03T08:43:43.869706908Z INFO     2024-12-03 08:43:43,867 infinity_emb INFO: ready   batch_handler.py:512
2024-12-03T08:43:43.869716492Z          to batch requests.
2024-12-03T08:43:43.871849030Z INFO     2024-12-03 08:43:43,870 infinity_emb INFO:       infinity_server.py:106
2024-12-03T08:43:43.871875049Z          ♾️  Infinity - Embedding Inference Server
2024-12-03T08:43:43.871881699Z          MIT License; Copyright (c) 2023-now Michael Feil
2024-12-03T08:43:43.871887428Z          Version 0.0.70
2024-12-03T08:43:43.871896681Z          Open the Docs via Swagger UI:
2024-12-03T08:43:43.871900798Z          http://0.0.0.0:7997/docs
2024-12-03T08:43:43.871908990Z          Access all deployed models via 'GET':
2024-12-03T08:43:43.871913136Z          curl http://0.0.0.0:7997/models
2024-12-03T08:43:43.871921018Z          Visit the docs for more information:
2024-12-03T08:43:43.871924893Z          https://michaelfeil.github.io/infinity
2024-12-03T08:43:43.872579980Z INFO:     Application startup complete.
2024-12-03T08:43:43.872964823Z INFO:     Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)

Checking the Runpod setup, it tells me that my application is live on https://3e00bu865kgyuf-7997.proxy.runpod.net

A quick curl to the /models endpoint and post to the embeddings endpoint, show that the models are indeed live!

$ curl https://3e00bu865kgyuf-7997.proxy.runpod.net/models

returns:

{"data":[{"id":"Snowflake/snowflake-arctic-embed-m","stats":{"queue_fraction":0.0,"queue_absolute":0,"results_pending":0,"batch_size":32},"object":"model","owned_by":"infinity","created":1733215830,"backend":"optimum","capabilities":["embed"]},{"id":"mixedbread-ai/mxbai-rerank-base-v1","stats":{"queue_fraction":0.0,"queue_absolute":0,"results_pending":0,"batch_size":32},"object":"model","owned_by":"infinity","created":1733215830,"backend":"optimum","capabilities":["rerank"]}],"object":"list"}infinity-emb-py3.10(base)
curl -X 'POST' \
  'https://3e00bu865kgyuf-7997.proxy.runpod.net/embeddings' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "Snowflake/snowflake-arctic-embed-m",
  "encoding_format": "float",
  "user": "string",
  "input": [
    "This is a sample sentence, to verify embeddings run on AMD"
  ],
  "modality": "text"
}'

Optimization for infinity on AMD:

Some strategies I personally use to maximize the number of requests handled by a instance:

  1. Small batch sizes and small models prefer onnx (--engine optimum) otherwise use pytorch --engine torch. This is due to the low CPU overhead and latency optimization with a history background on small computer vision models. Rule of thumb: The larger model_size * batch_size * max_context_length the more you should use torch
  2. Try torch compile --compile when using --engine torch. This will take longer for startup, but could give you a decent advantage.
  3. Adjust --batch-size depending on your VRAM. If you have a lot of VRAM available, you potentially could use a larger batch size. The larger batch size, will lower the proportion of weights vs intermediate_activations, and potentially push even small models in a compute bound regime (minimize the proportion of time the vram spends to fetch model weights). On MI300x --batch-size 128 could be interesting to try!
  4. consider adding more GPUs: --device-id in infinity selects the devices targeted. In the best case, you have multiple GPUs, and you could ~4x your thoughput by adding --device-id 0,1,2,3 to the cli when all GPUs are mounted.
  5. Don't use CPUs for long context embeddings. Running short queries is often fine, re-embedding your database will likely take a long time & not $/token efficient on large cpu clusters.

Current downsides of building on AMD

As this post is not sponsored or affiliated with any organization, there are currently some personal thoughts on challenges to expect & ome lessons learned from making infinity compatible with AMD.

  • Fused torch kernels are sometimes not available. infinity makes heavy usage nested tensors (i.e. torch._nested_tensor_from_mask) and encoder-model improvements. These are not part of torch's public API and currently have only limited or no support for AMD ROCM and Apple MPS.
  • The rocm-pytorch docker images are huge. 80GB-style decompressed huge. In the base image, we have 3 existing torch installations, definitely some improvement potential. Also out-of-reach for automated builds via Github CI.
  • poetry / pip / uv install ecosystem favors and defaults to nvidia, Workarounds on cpu / rocm via custom pip urls are hard & bitter lesson learned: pip install --extra-index-url .. is the only thing that works as of December 2024.
  • Occasionally bumpy roads for maintainers. Be prepared to occasionally build onnxruntime from scatch, taking up to 200GiB of tmp storage & up to an hour on a high end machine to build the image.

Summary:

In this Blog post you learned how to run embedding models such as Snowflake/snowflake-arctic-embed-m on AMD. Feel free to share the tutorial, fork & contribute to infinity!