Anyway to 'drop' model to save GPU ram?

#12
by rag-perplexity - opened

Hi, thank you for the reranker. It's working wonderfully in my RAG process.
One thing I am curious on is if there was a way to unload the model in python. On one of my RAG applications I use streamlit as a front end and I noticed that my GPU memory was constantly getting used up the longer I used the application.

To illustrate at app initiation the below code is executed to loan the reranker and my GPU ram usage increased by 1.5gb.

model = FlagReranker('BAAI/bge-reranker-v2-m3',  use_fp16=True)

I then use the app by inputting the prompt and my GPU ram usage increases by 2.5gb when this part of the code gets executed;

ranks = model.compute_score(sentence_pairs)

The problem is whenever new prompts are inputted the GPU ram usage increases when the code gets to the model.compute_score part - this eventually leads to all the GPU ram being used and the application unusable.

One way to I would deal with this is to 'drop' the model to free up the GPU ram usage whenever I have finished the model.compute_score execution but I don't know how. Any advice would be greatly appreciated, thanks!

All good.
Just realised torch.cuda.empty_cache coupled with aggressive variable deletion and gc.collect() will do the trick.

Sign up or log in to comment