Q6_K version is broken
the Q6_K version seems broken, I got rubbish output from this model, not even readable. The Q8_0 version works just fine.
No problem with this on my end with mixtral-8x7b-instruct-v0.1.Q6_K.gguf.
i use oobabooga/text-generation-webui on a RTX 3060 12GB and it has about ~2 token/sec response speed.
I use "instruct" in chat. it is important.
hmm, any chance the model could break during downloading, I did experience some interruption because of the network, usually with the HF-cli I can resume downloading without any issue.
Not a prompting issue I'm sure, the llama.cpp command-line can load the model, but the output is not human-readable at all.
Anyway, thanks for the repo!
No problem with this on my end with mixtral-8x7b-instruct-v0.1.Q6_K.gguf.
i use oobabooga/text-generation-webui on a RTX 3060 12GB and it has about ~2 token/sec response speed.I use "instruct" in chat. it is important.
How much system RAM do you have? I have a 3060 12GB too and 16 GB RAM and last I had checked even Q4_M wouldn't work.
in oobabooga/text-generation-webui there are two options to load this mixtral-8x7b-instruct-v0.1.Q6_K.gguf model:
-Model loader: llama.cpp which is slow (2-3 token/sec) seems the far best local llm can run on my hardware. (see above)
-Model loader: ctransformers which is fast (17-29 token/sec) but this seems not as clever. For example snake.py, generated wtiht this was always failed. i tried many options...
@robert1968
hmm thats interesting, i dont think ctransformers supports mixtral? and ctransformers is usually noticeably slower then llama.cpp as it uses much older versions.
so idk, wht if you are using mistral in ctransformers?