BAAI/bge-m3 · OOM occurs in the process of converting the model to torchscript. I have a question about this issue.

Thank you for opening up such a great model.
I am an AI Engineer in Korea and plan to use Korean embedding because of its good performance.

I'm trying to serve the model as triton.
OOM occurs in the process of converting the torch model to torchscript.
It seems that more than 40GB of GPU Memory is required.
The max_length of the tokenizer is 8192 and the padding is also set to max_length.

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained('BAAI/bge-m3').to("cuda")

class BGEM3(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids)
        last_hidden = outputs.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)

        embedding = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
        embedding = F.normalize(embedding, p=2, dim=1)
        return embedding


sentences = ["안녕 하세요." * 10000]
dummy_input = tokenizer(sentences, max_length=8192, padding="max_length", truncation=True, return_tensors='pt').to("cuda")

dummy_input_ids = dummy_input["input_ids"]
dummy_attention_mask = dummy_input["attention_mask"]


with torch.no_grad():
    torch_model = BGEM3(model)
    torch_model.eval()

    trace_model = torch.jit.trace_module(
        mod=torch_model,
        inputs={"forward": (dummy_input_ids, dummy_attention_mask)},
        check_trace=False,
    )

    trace_model.save("model.pt")

Has anyone experienced this kind of memory issue?