Loading the model 26gb?

#7
by MooCow27 - opened

I was trying to load up the model to integrate it with llama index, but does running this really use 26gb of vram? Is there a way to reduce this down?

Thanks!

The model would likely need to be quantized to use less memory. You could probably load it as-is with the --load-in-8bit flag when using text-generation-webui. (The 8bit feature is provided by the bitsandbytes python dep.)

To take it down farther, it could be quantized to 4-bits. There's another discussion thread here that talks about that.

For 8bit, you can run the model in its current form. For 4-bit, you'll have to run a quantization step yourself, which takes a while, but is totally doable on a local machine.

i bet this model released as FP32 instead of FP16

Sign up or log in to comment