sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Hello!

For embedding models, the memory requirement during inference actually fairly closely matches the size of the weight file (assuming that you're using the model in fp32, i.e. the default for loading, and the model is saved in fp32, i.e. the default for saving). For example:

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", device="cuda")
print(f"{torch.cuda.max_memory_allocated() / 1024**3:.3f}GB in use after loading model")

0.438GB in use after loading model

I recognize that this is on GPU and not on CPU, but the memory usage for the model itself should be the same between them. So, you can look at the weight size here:

You can also load a model on Google Colab with just CPU to see if it works well there. Those are fairly low-end machines as far as I know. E.g.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

Tom Aarsen

sentence-transformers
/

paraphrase-multilingual-MiniLM-L12-v2

How to estimate memory usage?