Q6_K version is broken

#19
by tankstarwar - opened

the Q6_K version seems broken, I got rubbish output from this model, not even readable. The Q8_0 version works just fine.

No problem with this on my end with mixtral-8x7b-instruct-v0.1.Q6_K.gguf.
i use oobabooga/text-generation-webui on a RTX 3060 12GB and it has about ~2 token/sec response speed.

I use "instruct" in chat. it is important.

hmm, any chance the model could break during downloading, I did experience some interruption because of the network, usually with the HF-cli I can resume downloading without any issue.

Not a prompting issue I'm sure, the llama.cpp command-line can load the model, but the output is not human-readable at all.

Anyway, thanks for the repo!

No problem with this on my end with mixtral-8x7b-instruct-v0.1.Q6_K.gguf.
i use oobabooga/text-generation-webui on a RTX 3060 12GB and it has about ~2 token/sec response speed.

I use "instruct" in chat. it is important.

How much system RAM do you have? I have a 3060 12GB too and 16 GB RAM and last I had checked even Q4_M wouldn't work.

deleted

I cant comment much on speed, as it was slow, but 6k works for me, using ooba. ( 64g xeon + 20G GPU )

in oobabooga/text-generation-webui there are two options to load this mixtral-8x7b-instruct-v0.1.Q6_K.gguf model:
-Model loader: llama.cpp which is slow (2-3 token/sec) seems the far best local llm can run on my hardware. (see above)
-Model loader: ctransformers which is fast (17-29 token/sec) but this seems not as clever. For example snake.py, generated wtiht this was always failed. i tried many options...

deleted

llama.cpp is what i normally use for gguf. Slow, but reliable.

@robert1968 hmm thats interesting, i dont think ctransformers supports mixtral? and ctransformers is usually noticeably slower then llama.cpp as it uses much older versions.
so idk, wht if you are using mistral in ctransformers?

Sign up or log in to comment