cross-encoder/ms-marco-MiniLM-L-6-v2 · Reranker Model Performance Optimization

Hello,
We are using the ms-marco-MiniLM-L-6-v2 model for our conceptual search application. Initially, we were reranking the top 1000 results, which was taking an average time of 2–2.5 seconds. For deployment, we were using Flask with Gunicorn, configured with 5 workers on a single GPU machine.

Now, we are planning to increase the reranking scope from the top 1000 results to the top 3000. However, we are facing a significant performance hit, with the average time increasing to 6–7 seconds for reranking the passed top results. We are loading the model as per the documentation provided in the model card for the sentence-transformers package.

Could you please advise if we are doing anything wrong, either in terms of model loading or from a deployment perspective?

Additionally, as per the model card, the model is capable of reranking 1800 documents/chunks per sec.

Thanks.