BAAI/bge-reranker-v2-m3 · Multi-GPU at FP16? Examples. Large memory allocations.

First - thank you!

I’ve noticed that it seems hard, for me, to get this to run on a GPU array. I can post my inference script later today EST.

I’m deploying to two AWS instances p3.8xlarge and ml.g6.24xlarge.

I finally got it to run at FP16 using Accelerate mixed precision and model.half. It seems to always want to load to device 0. On first load it is about 5GB on device 0. The first inference of 100 document abstracts of normal length, throws a CUDA out of memory error, where it wants to allocated 10-16GB. I’m sending 100 all at once and I am not batching. I’m beginning to consider to batch in groups of 25 but also want to address these other things.

Since it’s a huggingface friendly model I’m using the custom inference template as a basis. Do you have an example script of getting it to load on multiple GPU at FP16? And am AWS ECR base image you recommend?

I have feature flags to log out nvidia-smi where I can see the GPU stats.

Great work on all of this! Thank you!

import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-v2-m3') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-v2-m3', device_map="balanced_low_0") model.half() model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores)