Response time

#4
by praxis-dev - opened

hi. this model gives me 1.5 minutes of waiting time before responding on double teslas t4 on modal .com platform. how should I perceive this number? Fast, long? Can It be optimized?

Hmm I think the problem is that first, you have to install llama cpp with cublas. Then also put the gpu layers as like 50.
This should increase speed massively!
However, the fastest way to run a model on gpus is with exllama v2 so try that instead.

Sign up or log in to comment