Reducing Latency in Locally Hosted model
#8
by
anshulchandel
- opened
I've been working on hosting the DeepSeek 6.7b LLM model locally on my machine for some time now, and while the results are impressive, I'm encountering higher latency than I'd like. I'm reaching out to gather insights and strategies from this community on how to optimize and reduce this latency.
anshulchandel
changed discussion title from
Reducing Latency in Locally Hosted DeepSeek 6.7b LLM Model
to Reducing Latency in Locally Hosted model
@anshulchandel instead of using gguf, use exllamav2 if you can fit the model onto your gpu. That will be slightly below 2x faster then llama.cpp