Hardware Requirements for Q4_K_M

#19
by ShivanshMathur007 - opened

What is the minimum hardware required and what is the recommended hardware for Q4_K_M with a decent speed. Any idea?

@ShivanshMathur007 I believe it should fit barely in a 24gb vram gpu and you should get very fast speeds. However you might get Out of memory error if you arent using a headless machine and you wont really fit a lot of context.

You probably have to offload some layers to cpu and it should still be reasonable speed like 10 -15 tokens per second

Until now I didn't even manage to run mixtral-8x7b-v0.1.Q3_K_M.gguf on a 3090ti (24GB VRAM), only the mixtral-8x7b-v0.1.Q2_K.gguf works. Even with the smallest model, asking questions to a document (embedding_model: intfloat/multilingual-e5-large) is not feaseable, for the answers don't refrence uploaded files and the model stalls and has to be "restarted".

That said, this information is with a PrivateGPT frontend, perhaps someone knows a workaround.

@Rhylean the problem is that u are using a embedding model and a large one so that will take a decent amount of vram as well.
either use a a very small one or none if you want to run mixtral fully on gpu.

u can offload to cpu and PrivateGPT uses llama.cpp so it should have that feature. but it might not have it directly.

@YaTharThShaRma999 Thank you for your reply. I managed to find the critical line in the .py of the llm and managed to make it work.

Sign up or log in to comment