OOM occurs in the process of converting the model to torchscript. I have a question about this issue.

#8
by LeeJungHoon - opened

Thank you for opening up such a great model.
I am an AI Engineer in Korea and plan to use Korean embedding because of its good performance.

I'm trying to serve the model as triton.
OOM occurs in the process of converting the torch model to torchscript.
It seems that more than 40GB of GPU Memory is required.
The max_length of the tokenizer is 8192 and the padding is also set to max_length.

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained('BAAI/bge-m3').to("cuda")

class BGEM3(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids)
        last_hidden = outputs.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)

        embedding = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
        embedding = F.normalize(embedding, p=2, dim=1)
        return embedding


sentences = ["μ•ˆλ…• ν•˜μ„Έμš”." * 10000]
dummy_input = tokenizer(sentences, max_length=8192, padding="max_length", truncation=True, return_tensors='pt').to("cuda")

dummy_input_ids = dummy_input["input_ids"]
dummy_attention_mask = dummy_input["attention_mask"]


with torch.no_grad():
    torch_model = BGEM3(model)
    torch_model.eval()

    trace_model = torch.jit.trace_module(
        mod=torch_model,
        inputs={"forward": (dummy_input_ids, dummy_attention_mask)},
        check_trace=False,
    )

    trace_model.save("model.pt")

Has anyone experienced this kind of memory issue?

Hello, I tested your code using one A800 GPU. The test results show that it only needs 18.8GB. Therefore, 40GB memory is enough.
What's more, there is an issue with your code. The pooling method implemented in your code is mean pooling. However, the pooling method of bge-m3 is CLS pooling, not mean pooling.

Sign up or log in to comment