Question about being able to load the model

#2
by vmajor - opened

Since the GGUF/llama.cpp option is broken (for now?) I want to see if transformers has 'advanced' enough to allow loading models in 8bit when using CPUs for inference.

I have a woeful amount of VRAM, but with 256 GB of RAM, a quantized version of the model should fit... but since transformers have not been relevant to me for a while due to their VRAM focus for development, I am out of touch regarding transformers ability to use the CPU with quantization methods, so I am interested in seeing if you or someone else that haunts your page could chime in and offer some advice.

Someone has posted a quant that works with llama.cpp here: https://huggingface.co/imi2/airoboros-180b-2.2.1-gguf
Just make sure you're running the latest version of llama.cpp and follow the instructions for merging the files.
Here's the command I use to run it:
./server --model models/airoboros-180b-2.2.1-Q5_K_M.gguf --n-gpu-layers 128 --ctx-size 4090 --port 5005 --host 0.0.0.0 --parallel 1 --cont-batching --threads 24

Sign up or log in to comment