I used transformers library of intfloat/multilingual-e5-large this model. I used the same code that is shared in its model card. I dockerized it and started to use. It increases the memory in each inference and then it exceeding my memory limit after a while.

Here is my code,

import torch.nn.functional as F
from torch import Tensor, no_grad, cuda, device
from transformers import AutoTokenizer, AutoModel
import gc

class Model():

def __init__(self, path='resources/intfloat_multilingual-e5-large'):
    self.tokenizer = AutoTokenizer.from_pretrained(path)
    self.model = AutoModel.from_pretrained(path)
    dvc = device('cpu')

def average_pool(self, last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def inference(self, texts): 
    with no_grad():
        batch_dict = self.tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
        outputs = self.model(**batch_dict)
        embeddings = self.average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
        del outputs
        embeddings = F.normalize(embeddings, p=2, dim=1)
        embeddings = embeddings.numpy().tolist()
    return embeddings

model = Model()

Here is my docker stats. Its initially uses around 2.6gb ram in memory. But in each iteration it increases slowly.
Please let me know if I can clear the cache of the memory or in any way I can stop this memory leak.

This looks strange, python and pytorch should do GC automatically. Is it possible that you store too many embedding vectors that cause the OOM issue?

