qwp4w3hyb/Meta-Llama-3-8B-Instruct-iMat-GGUF · Thank you for the hint about fixing the eos_token

Apr 22

I could finally test the 70B Llama-3 model with IQ3_XS quants on my MacBook with 36GB RAM. Before applying the fix, which looked like this for my case:

python3 ./llama.cpp/gguf-py/scripts/gguf-set-metadata.py Meta-Llama-3-70B-Instruct.IQ3_XS.gguf tokenizer.ggml.eos_token_id 128009 --force

Before applying the fix, the model would only stop when the max_tokens limit was reached. On my MacBook, the Q3_K_S model was just too large (but I had one that included the fix already). The IQ3_XS quantized model just worked well after increasing the RAM limit for the GPUs to 32000 with this command:

sudo sysctl iogpu.wired_limit_mb=32000

qwp4w3hyb

Owner Apr 22

You're welcome, I now also have fixed 70B quants available. https://huggingface.co/qwp4w3hyb/Meta-Llama-3-70B-Instruct-iMat-GGUF

qwp4w3hyb changed discussion status to closed Apr 29

qwp4w3hyb
/

Meta-Llama-3-8B-Instruct-iMat-GGUF

Thank you for the hint about fixing the eos_token_id