Reducing Latency in Locally Hosted model

by anshulchandel - opened Apr 26, 2024

Apr 26, 2024

I've been working on hosting the DeepSeek 6.7b LLM model locally on my machine for some time now, and while the results are impressive, I'm encountering higher latency than I'd like. I'm reaching out to gather insights and strategies from this community on how to optimize and reduce this latency.

anshulchandel changed discussion title from Reducing Latency in Locally Hosted DeepSeek 6.7b LLM Model to Reducing Latency in Locally Hosted model Apr 26, 2024

YaTharThShaRma999

Apr 26, 2024

@anshulchandel instead of using gguf, use exllamav2 if you can fit the model onto your gpu. That will be slightly below 2x faster then llama.cpp

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment