Hardware Requirements for Q4_K_M

#19

by ShivanshMathur007 - opened Jan 10, 2024

Discussion

ShivanshMathur007

Jan 10, 2024

•

edited Jan 10, 2024

What is the minimum hardware required and what is the recommended hardware for Q4_K_M with a decent speed. Any idea?

YaTharThShaRma999

Jan 11, 2024

@ShivanshMathur007 I believe it should fit barely in a 24gb vram gpu and you should get very fast speeds. However you might get Out of memory error if you arent using a headless machine and you wont really fit a lot of context.

You probably have to offload some layers to cpu and it should still be reasonable speed like 10 -15 tokens per second

Rhylean

Jan 24, 2024

•

edited Jan 24, 2024

Until now I didn't even manage to run mixtral-8x7b-v0.1.Q3_K_M.gguf on a 3090ti (24GB VRAM), only the mixtral-8x7b-v0.1.Q2_K.gguf works. Even with the smallest model, asking questions to a document (embedding_model: intfloat/multilingual-e5-large) is not feaseable, for the answers don't refrence uploaded files and the model stalls and has to be "restarted".

That said, this information is with a PrivateGPT frontend, perhaps someone knows a workaround.

YaTharThShaRma999

Jan 29, 2024

@Rhylean the problem is that u are using a embedding model and a large one so that will take a decent amount of vram as well.
either use a a very small one or none if you want to run mixtral fully on gpu.

u can offload to cpu and PrivateGPT uses llama.cpp so it should have that feature. but it might not have it directly.

Rhylean

Jan 30, 2024

@YaTharThShaRma999 Thank you for your reply. I managed to find the critical line in the .py of the llm and managed to make it work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment