Reducing Latency in Locally Hosted model

#8
by anshulchandel - opened

I've been working on hosting the DeepSeek 6.7b LLM model locally on my machine for some time now, and while the results are impressive, I'm encountering higher latency than I'd like. I'm reaching out to gather insights and strategies from this community on how to optimize and reduce this latency.

anshulchandel changed discussion title from Reducing Latency in Locally Hosted DeepSeek 6.7b LLM Model to Reducing Latency in Locally Hosted model

@anshulchandel instead of using gguf, use exllamav2 if you can fit the model onto your gpu. That will be slightly below 2x faster then llama.cpp

Sign up or log in to comment