Accelerating Embedding & Reranking Models on AMD Using Infinity
This guest article is written by the maintainer of michaelfeil/infinity, a popular open-source library for inferencing text-embedding, reranking, vision embedding (clip, colqwen) and audio embedding models on AMD with a high throughput engine. This article "contains a tutorial on how to quickly deploy an AMD-based embedding solution on ROCm via Pytorch and ONNX, a comparision how to accomplish the same thing on a Nvidia and a quick optimization guide for AMD.
Making AMD more popular?!
Why AMD GPUs - when talking to people at Hackathons, community meetups and on following discussions on the Localllama subreddit, it often feels that AMD is forgotten, capabilities downplayed, even saying it will take years to catch up. Meanwhile, you can go to your favorite website (https://pytorch.org/get-started/locally/) and have torch with rocm with a one-liner install command. The community is somewhat right - according to anonymized user submissions with infinity in December 2024, barely any AMD GPUs are currently used to run infinity, in last week, only 0.7% of the infinity uses were on AMD.
Image: Infinity usage, Nov26-Dec3-2024, excluding MPS and CPU targets.
As a true open-source project it is interesting to maintain vender neutrality, as long as the software remains easy to maintain - infinty has added Apple MPS support, has best-in-class CPU support and also aims to be compatible with AWS Inferentia in the near future - It’s time to tackle AMD!!
Tutorial
Recap: How Infinity runs via Docker on CUDA and CPU
To use a accelerated image on docker + accelerator specific installation. The accelerated specific installation helps to share the driver specfic details from the docker host & enables sharing of devices.
For nvidia, all you need to do is instlal the nvidia-container-toolkit nvidia-container-toolkit.
Once that is done, the infinity instructions are already updated in a couple of popular repos e.g. snowflake-arctic-embed-m.
# remove # to run example
docker run \
--gpus all \ # --gpus mounts the NVIDIA GPUS - and requires the nvidia-docker toolkit.
-p 7997:7997 \ # port forwarding
michaelf34/infinity:0.0.70 \ # selecting the Dockerfile.nvidia image from infinity
v2 \
--model-id Snowflake/snowflake-arctic-embed-m \
--engine torch # tell infinity to run the model using the pytorch backend engine.
And its running. If we wanted to accomplish the same thing on CPU, we might have been better of with ONNX. Thankfully, the arctic model has an extra set of ONNX weights, located in [`./onnx/](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/tree/main/onnx).
docker run \
\ # no gpus mounted
-p 7997:7997 \ # port forwarding
michaelf34/infinity:0.0.70-cpu \
v2 \
--model-id Snowflake/snowflake-arctic-embed-m \
--engine optimum # tell infinity to run the model using the onnx/optimum backend engine.
Tutorial 1: Running Embedding Models on AMD on AMD ROCm
Here’s how to reproduce the example on AMD.
First, we require to have a compatible GPU & have the AMD-container-toolkit installed https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html. For AMD, that piece of software is called ROCm kernel-mode driver, amdgpu-dkms.
With amdgpu-dkms
and amd drivers, any AMD pytorch image is ready to be launched.
In this case, the image is based on [rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
] (https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/Dockerfile.amd_auto), requiring a rocm6.2.3
compatible installation on the host. You can pull the image without login from docker hub via michaelf34/infinity:0.0.70-amd
Instead of mounting the GPUs via --gpus
we instead add the following to the docker run command.
--security-opt seccomp=unconfined \ # mount the devices for
--device=/dev/kfd \
--device=/dev/dri \
To put it all together:
docker run -it \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
-p 7997:7997 \
michaelf34/infinity:0.0.70-amd \
v2 \
--model-id Snowflake/snowflake-arctic-embed-m \
--engine torch # tell infinity to run the model using the onnx/optimum backend engine.
Tutorial 2: Running Reranking Models on AMD ROCm via ONNX on MI300x
To run the same model on AMD via onnx, we have to turn on an extra build arg (--build-arg GPU_ARCH=gfx942
), building the onnxruntime for the gfx942 (only MI300x!) gpu arch. Also, we select the --engine optimum
for onnx inference.
For other build targets, you can use michaelf34/infinity:0.0.70-amd-gfx94a
(MI200x) and michaelf34/infinity:0.0.70-amd-gfx1100
(selected AMD Radeon Cards). Special thanks for the contributors at embeddedllm.com for this!
docker run -it \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \ # mount the devices for
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
-p 7997:7997 \
michaelf34/infinity:0.0.70-amd-gfx942 \ # selecting the Dockerfile.rocm with GPU_ARCH=gfx942 build arch
v2 \
--model-id Snowflake/snowflake-arctic-embed-m --engine optimum --device cuda \
--model-id mixedbread-ai/mxbai-rerank-base-v1 --engine optimum --device cuda # repeating multiple `model-id` allows for multiple model launch.
Tutorial 3: Without the docker run setup (Runpod.io)
,This blog is not sponsored by Runpod. I used it as as of 2024, its a very simple way to run AMD images on AMD-MI300x.
Here is the launch configuration: In the UI, I select the MI300x node & modify port
, image=michaelf34/infinity:0.0.70-amd-gfx942
, infinity cli version (use v2
!) and both models: --model-id Snowflake/snowflake-arctic-embed-m --engine optimum --model-id mixedbread-ai/mxbai-rerank-base-v1 --engine optimum
.
As soon as the startup is complete, you should see the following printed to the logs: INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
in the console.
Also, the log contains some basic performance stats collected during warmup: 487 (len=512)- 12021 (len=3)
embeddings/second & up to 4836 rerank requests/second. This is excluding the API + dynamic batching overhead, but still very decent, especially for short requests! A single replica could handle potentially up to 1000 concurrent RAG users sending short queries - not bad at all! At this point, likely other parts of the infrastructure become the bottleneck!
Click here to show full logs
2024-12-03T08:43:23.843767712Z INFO: Waiting for application startup.
2024-12-03T08:43:23.847049714Z INFO 2024-12-03 08:43:23,843 infinity_emb INFO: infinity_server.py:92
2024-12-03T08:43:23.847099298Z Creating 2engines:
2024-12-03T08:43:23.847104256Z engines=['Snowflake/snowflake-arctic-embed-m',
2024-12-03T08:43:23.847108382Z 'mixedbread-ai/mxbai-rerank-base-v1']
2024-12-03T08:43:23.847677370Z INFO 2024-12-03 08:43:23,846 infinity_emb INFO: Anonymized telemetry.py:30
2024-12-03T08:43:23.847687605Z telemetry can be disabled via environment variable
2024-12-03T08:43:23.847690820Z `DO_NOT_TRACK=1`.
2024-12-03T08:43:23.852227822Z INFO 2024-12-03 08:43:23,851 infinity_emb INFO: select_model.py:64
2024-12-03T08:43:23.852244597Z model=`Snowflake/snowflake-arctic-embed-m` selected,
2024-12-03T08:43:23.852248152Z using engine=`optimum` and device=`cuda`
2024-12-03T08:43:24.307934415Z INFO 2024-12-03 08:43:24,304 infinity_emb INFO: Found 7 utils_optimum.py:244
2024-12-03T08:43:24.307985381Z onnx files: [PosixPath('onnx/model.onnx'),
2024-12-03T08:43:24.307990809Z PosixPath('onnx/model_bnb4.onnx'),
2024-12-03T08:43:24.307994915Z PosixPath('onnx/model_fp16.onnx'),
2024-12-03T08:43:24.307998901Z PosixPath('onnx/model_int8.onnx'),
2024-12-03T08:43:24.308002676Z PosixPath('onnx/model_q4.onnx'),
2024-12-03T08:43:24.308006482Z PosixPath('onnx/model_quantized.onnx'),
2024-12-03T08:43:24.308010748Z PosixPath('onnx/model_uint8.onnx')]
2024-12-03T08:43:24.309548020Z INFO 2024-12-03 08:43:24,307 infinity_emb INFO: Using utils_optimum.py:248
2024-12-03T08:43:24.309575701Z onnx/model.onnx as the model
2024-12-03T08:43:24.707931233Z The ONNX file onnx/model.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
2024-12-03T08:43:33.299022861Z [0;93m2024-12-03 08:43:33.298756964 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m
2024-12-03T08:43:33.299078624Z [0;93m2024-12-03 08:43:33.298775843 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.[m
2024-12-03T08:43:34.174298204Z INFO 2024-12-03 08:43:34,171 infinity_emb INFO: Getting select_model.py:97
2024-12-03T08:43:34.174339565Z timings for batch_size=32 and avg tokens per
2024-12-03T08:43:34.174344533Z sentence=3
2024-12-03T08:43:34.174348629Z 0.38 ms tokenization
2024-12-03T08:43:34.174352464Z 2.22 ms inference
2024-12-03T08:43:34.174356300Z 0.06 ms post-processing
2024-12-03T08:43:34.174360326Z 2.66 ms total
2024-12-03T08:43:34.174364172Z embeddings/sec: 12021.05
2024-12-03T08:43:34.326102645Z INFO 2024-12-03 08:43:34,324 infinity_emb INFO: Getting select_model.py:103
2024-12-03T08:43:34.326140922Z timings for batch_size=32 and avg tokens per
2024-12-03T08:43:34.326145960Z sentence=512
2024-12-03T08:43:34.326177467Z 7.80 ms tokenization
2024-12-03T08:43:34.326181933Z 57.67 ms inference
2024-12-03T08:43:34.326186220Z 0.13 ms post-processing
2024-12-03T08:43:34.326190196Z 65.60 ms total
2024-12-03T08:43:34.326194262Z embeddings/sec: 487.80
2024-12-03T08:43:34.326716801Z INFO 2024-12-03 08:43:34,325 infinity_emb INFO: model select_model.py:104
2024-12-03T08:43:34.326738783Z warmed up, between 487.80-12021.05 embeddings/sec
2024-12-03T08:43:34.326748608Z at batch_size=32
2024-12-03T08:43:34.332365460Z INFO 2024-12-03 08:43:34,331 infinity_emb INFO: select_model.py:64
2024-12-03T08:43:34.332386802Z model=`mixedbread-ai/mxbai-rerank-base-v1` selected,
2024-12-03T08:43:34.332391909Z using engine=`optimum` and device=`cuda`
2024-12-03T08:43:34.737521909Z INFO 2024-12-03 08:43:34,736 infinity_emb INFO: Found 2 utils_optimum.py:244
2024-12-03T08:43:34.737551914Z onnx files: [PosixPath('onnx/model.onnx'),
2024-12-03T08:43:34.737554037Z PosixPath('onnx/model_quantized.onnx')]
2024-12-03T08:43:34.738028694Z INFO 2024-12-03 08:43:34,737 infinity_emb INFO: Using utils_optimum.py:248
2024-12-03T08:43:34.738033541Z onnx/model.onnx as the model
2024-12-03T08:43:35.260076178Z The ONNX file onnx/model.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
2024-12-03T08:43:42.332373238Z [0;93m2024-12-03 08:43:42.332105549 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m
2024-12-03T08:43:42.332428290Z [0;93m2024-12-03 08:43:42.332129364 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.[m
2024-12-03T08:43:43.528743277Z INFO 2024-12-03 08:43:43,526 infinity_emb INFO: Getting select_model.py:97
2024-12-03T08:43:43.528793812Z timings for batch_size=32 and avg tokens per
2024-12-03T08:43:43.528802886Z sentence=4
2024-12-03T08:43:43.528809295Z 0.49 ms tokenization
2024-12-03T08:43:43.528814814Z 6.12 ms inference
2024-12-03T08:43:43.528820282Z 0.01 ms post-processing
2024-12-03T08:43:43.528825580Z 6.62 ms total
2024-12-03T08:43:43.528831038Z embeddings/sec: 4836.28
2024-12-03T08:43:43.859858228Z INFO 2024-12-03 08:43:43,858 infinity_emb INFO: Getting select_model.py:103
2024-12-03T08:43:43.859906941Z timings for batch_size=32 and avg tokens per
2024-12-03T08:43:43.859912008Z sentence=512
2024-12-03T08:43:43.859916225Z 25.66 ms tokenization
2024-12-03T08:43:43.859920161Z 106.47 ms inference
2024-12-03T08:43:43.859923996Z 0.03 ms post-processing
2024-12-03T08:43:43.859927862Z 132.16 ms total
2024-12-03T08:43:43.859952338Z embeddings/sec: 242.13
2024-12-03T08:43:43.860544431Z INFO 2024-12-03 08:43:43,859 infinity_emb INFO: model select_model.py:104
2024-12-03T08:43:43.860571682Z warmed up, between 242.13-4836.28 embeddings/sec at
2024-12-03T08:43:43.860577340Z batch_size=32
2024-12-03T08:43:43.861875996Z INFO 2024-12-03 08:43:43,861 infinity_emb INFO: batch_handler.py:443
2024-12-03T08:43:43.861889265Z creating batching engine
2024-12-03T08:43:43.865405636Z INFO 2024-12-03 08:43:43,862 infinity_emb INFO: ready batch_handler.py:512
2024-12-03T08:43:43.865426647Z to batch requests.
2024-12-03T08:43:43.866914465Z INFO 2024-12-03 08:43:43,866 infinity_emb INFO: batch_handler.py:443
2024-12-03T08:43:43.866928997Z creating batching engine
2024-12-03T08:43:43.869706908Z INFO 2024-12-03 08:43:43,867 infinity_emb INFO: ready batch_handler.py:512
2024-12-03T08:43:43.869716492Z to batch requests.
2024-12-03T08:43:43.871849030Z INFO 2024-12-03 08:43:43,870 infinity_emb INFO: infinity_server.py:106
2024-12-03T08:43:43.871875049Z ♾️ Infinity - Embedding Inference Server
2024-12-03T08:43:43.871881699Z MIT License; Copyright (c) 2023-now Michael Feil
2024-12-03T08:43:43.871887428Z Version 0.0.70
2024-12-03T08:43:43.871896681Z Open the Docs via Swagger UI:
2024-12-03T08:43:43.871900798Z http://0.0.0.0:7997/docs
2024-12-03T08:43:43.871908990Z Access all deployed models via 'GET':
2024-12-03T08:43:43.871913136Z curl http://0.0.0.0:7997/models
2024-12-03T08:43:43.871921018Z Visit the docs for more information:
2024-12-03T08:43:43.871924893Z https://michaelfeil.github.io/infinity
2024-12-03T08:43:43.872579980Z INFO: Application startup complete.
2024-12-03T08:43:43.872964823Z INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
Checking the Runpod setup, it tells me that my application is live on https://3e00bu865kgyuf-7997.proxy.runpod.net
A quick curl to the /models
endpoint and post to the embeddings endpoint, show that the models are indeed live!
$ curl https://3e00bu865kgyuf-7997.proxy.runpod.net/models
returns:
{"data":[{"id":"Snowflake/snowflake-arctic-embed-m","stats":{"queue_fraction":0.0,"queue_absolute":0,"results_pending":0,"batch_size":32},"object":"model","owned_by":"infinity","created":1733215830,"backend":"optimum","capabilities":["embed"]},{"id":"mixedbread-ai/mxbai-rerank-base-v1","stats":{"queue_fraction":0.0,"queue_absolute":0,"results_pending":0,"batch_size":32},"object":"model","owned_by":"infinity","created":1733215830,"backend":"optimum","capabilities":["rerank"]}],"object":"list"}infinity-emb-py3.10(base)
curl -X 'POST' \
'https://3e00bu865kgyuf-7997.proxy.runpod.net/embeddings' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "Snowflake/snowflake-arctic-embed-m",
"encoding_format": "float",
"user": "string",
"input": [
"This is a sample sentence, to verify embeddings run on AMD"
],
"modality": "text"
}'
Optimization for infinity on AMD:
Some strategies I personally use to maximize the number of requests handled by a instance:
- Small batch sizes and small models prefer
onnx
(--engine optimum
) otherwise use pytorch--engine torch
. This is due to the low CPU overhead and latency optimization with a history background on small computer vision models. Rule of thumb: The largermodel_size * batch_size * max_context_length
the more you should usetorch
- Try torch compile
--compile
when using--engine torch
. This will take longer for startup, but could give you a decent advantage. - Adjust
--batch-size
depending on your VRAM. If you have a lot of VRAM available, you potentially could use a larger batch size. The larger batch size, will lower the proportion of weights vs intermediate_activations, and potentially push even small models in a compute bound regime (minimize the proportion of time the vram spends to fetch model weights). On MI300x--batch-size 128
could be interesting to try! - consider adding more GPUs:
--device-id
in infinity selects the devices targeted. In the best case, you have multiple GPUs, and you could ~4x your thoughput by adding--device-id 0,1,2,3
to the cli when all GPUs are mounted. - Don't use CPUs for long context embeddings. Running short queries is often fine, re-embedding your database will likely take a long time & not $/token efficient on large cpu clusters.
Current downsides of building on AMD
As this post is not sponsored or affiliated with any organization, there are currently some personal thoughts on challenges to expect & ome lessons learned from making infinity compatible with AMD.
- Fused torch kernels are sometimes not available. infinity makes heavy usage nested tensors (i.e.
torch._nested_tensor_from_mask
) and encoder-model improvements. These are not part of torch's public API and currently have only limited or no support for AMD ROCM and Apple MPS. - The rocm-pytorch docker images are huge. 80GB-style decompressed huge. In the base image, we have 3 existing torch installations, definitely some improvement potential. Also out-of-reach for automated builds via Github CI.
- poetry / pip / uv install ecosystem favors and defaults to nvidia, Workarounds on cpu / rocm via custom pip urls are hard & bitter lesson learned:
pip install --extra-index-url ..
is the only thing that works as of December 2024. - Occasionally bumpy roads for maintainers. Be prepared to occasionally build onnxruntime from scatch, taking up to 200GiB of tmp storage & up to an hour on a high end machine to build the image.
Summary:
In this Blog post you learned how to run embedding models such as Snowflake/snowflake-arctic-embed-m
on AMD. Feel free to share the tutorial, fork & contribute to infinity!