How to launch grok-1-IQ3_XS-split-00001-of-00009.gguf?

#13
by UH7yx - opened

Dear Gurus,
Could you please advise on steps to launch grok-1-IQ3_XS-split-00001-of-00009.gguf?
I see strange server command line command. Where is it from?
Appreciate any step by step guide.

Owner

Have a look at llama.cpp.

  1. Follow the Build / installation instructions for llama.cpp for your system.
  2. a) Download the files into one folder und run llama.cpp (main or server depending on your needs) and specify the first of the split files as the model (-m <path/to/first/split/file>)
    b) just run main or server with those strange command line options from the readme to have llama.cpp download the files for you and then run.

To make it maybe a little bit more clear.

  1. After downloading and compiling llama.cpp as usual.
    Note: Take the latest version - they introduced grok support into the main brunch, 3 days ago.
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make clean && LLAMA_CUBLAS=1 make
  1. For 3bit i.e. IQ3_XS. Call
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00001-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00002-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00003-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00004-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00005-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00006-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00007-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00008-of-00009.gguf
wget https://huggingface.co/Arki05/Grok-1-GGUF/resolve/main/grok-1-IQ3_XS-split-00009-of-00009.gguf
  1. Call
./main --model grok-1-IQ3_XS-split-00001-of-00009.gguf -ngl 99 -p "Hello! Who are you?"

A machine with two H100 or A100 80GB should work.

Must say I am disappointed... I am almost certain there is some bug or something. Was a total failure for me, but maybe it will be fixes or further tuned. Still have to say this runs on a machine with two H100, costs about 4$-6$/hour. Not cheap and totally useless currently, but very impressive that it's even possible. Will be waiting for updates. Probably some middle range big guns, like Hermes and Wizard will fine tune those models further.

@Arki05 Many thanks!

P.S. Would be nice to see grok alpaca or guanaco or something.

P.S. In vast.ai needed Template: nvidia/cuda:12.0.1-devel-ubuntu20.04 -> Edit -> Run a jupyter notebook

P.S. With the new GPUs cuda 11.7 doesn't work. Instead we need the latest cuda, for llama.cpp compilation on H100 and A100.

Thank you so very much, gentlemen. Will try on my cuda-less laptop with 64Gb. Perhaps it at least will start. Original non-quantized grok-1 model just told me 'cuda not found' and crashed...

Somewhere else Arki05 said there is only two active model at every specific token, so 2-bit or even 3-bit will work loading each token the two needed models, will take forever but might not crash. Anyway for now mainly junk returns, so we will need to wait some more tuning of the model.

@UH7yx 'cuda not found' sounds like environment setup. You needed jax there and some more installations. llama cpp is way simpler to install or you can download the binaries. All this doesn't change the fact for now it's all junk.

Funny thing is that it even works on 16Gb of RAM, but it works extremely slow.
BTW, to the test phrase "Hello, who are you?" it answers "Hi! My name is Sara, and I'm a UX designer from San Francisco. I'm a total goofball and a huge bookworm. I love learning, trying new things, and"
which is total bullshit. Can we regulate temperature somehow? :)

Yes, this model is crazy and does not worth installing. It jokes, trolls, but never gives useful answers, unlike any other model. Total disappointment and money waste.
It's rather a joke generator, than a regular AI.

@UH7yx Check out llama.cpp flags using main --help, --temp is the temperature flag.

But I run with temperature 0 and it's the same junk - just need fine tuning layer on top of this base model. Happens to a lot of releases of base models.

It's not such a good model for now - but it's a base model, that means the next iteration of training on top of it should work pretty good, will take time, as the model is huge so training it will take time and resources.

Sign up or log in to comment