Enhance response time

#8
by Janmejay123 - opened

I used the llama-2-7b-chat.ggmlv3.q8_0.bin model and tried to get response from the prompt i provided and it is taking around 2-3mins to return the response locally. How to reduce the response time?

@Situn007
Well first I think update your llama.cpp or whatever thing you are using. Ggml is a very outdated format and you should use gguf models.

If you are using a gpu, install llama cpp with cublas and set gpu layers to -1

Else install llama cpp with openblas

@YaTharThShaRma999
The way i am using ggml can i use gguf the same way?

@Janmejay123 yeah it’s the same exact thing except gguf has a bunch of metadata attached like prompt format, rope, and more so it’s easier to run(it’s still 1 file).

And since you update llama.cpp, it should be much faster as new things have been introduced

Sign up or log in to comment