Load 13B model with 8-bit/4-bit quantization to support more hardwares

#2
by liuhaotian - opened

Hi, LLaVA author here. Thank you for contributing the Huggingface space.

It would be better to keep the model version consistent with the official demo (13B). Quantization can be used to support more hardwares, see discussion here.

I have updated the support for quantization and necessary instructions on controlling the quantization bits by the environment variable bits.

By default, it sets it to 8-bit to support running on A10G (this space). It can also be set to 4-bit to run on the smaller T4-medium (15G). The quantization bits for the current model will be indicated by the model name in the model selector dropdown.

Thanks.


You can load the model with 8-bit or 4-bit quantization to make it fit in smaller hardwares. Setting the environment variable bits to control the quantization.

Recommended configurations:

Hardware Bits
A10G-Large (24G) 8 (default)
T4-Medium (15G) 4
A100-Large (40G) 16

Also, please help add the environment variable here to better guide the users in where to set the environment variable bits.

屏幕截图 2023-10-10 104513.png

屏幕截图 2023-10-10 104519.png

thank you @liuhaotian !
Previously, I also tried to use 4bits for it, but there is an issue stating that bitsandbytes is not configured correctly in the docker environment of the space. So, it was not possible to use it, did you have a chance to test with the changes in this PR?

Yes I have tested that on a T4-medium here.

note the space is now compiling/downloading model as I am trying to see if we can skip the preload part, but it works before this debugging (which is the one I commited).

Ah, also do you think the instructions above is taking too much of the vertical space? We may change that if it can be turned to look better.

tbh, i also dislike the preload part due to:

  • very long build times
  • not being able to cache it

but I mainly did it so that if the Gradio app is launched, then there should be "a model". if we remove the preload part it will just work too, since the worker will be downloading it in the background.
however, the user will see an empty dropdown with no information about the download status, and that felt like bad UX (open to discuss potential solutions :)
Also: I tried the docker option I was able to cache the downloads, but it couldn't find cuda, so I gave up on that.

liuhaotian changed pull request status to closed
liuhaotian changed pull request status to open

Transposed version has more space but less intuitive, wdyt?

Recommended configurations:

Hardware Bits
A10G-Large (24G) 8 (default)
T4-Medium (15G) 4
A100-Large (40G) 16
Hardware A10G-Large (24G) T4-Medium (15G) A100-Large (40G)
Bits 8 (default) 4 16

it's looking great! updated the PR to adopt the transposed layout

thanks!

badayvedat changed pull request status to merged

one bad thing about the preload..

after removing the preload, it works on the smallest t4-small ..

https://huggingface.co/spaces/liuhaotian/LLaVA

wow, thanks for trying that!
I will take a look into the model dropdown component for a "downloading status" so that the user will know about the model downloading process. and after that, we can remove the preload.

that sounds great, thank you!

Sign up or log in to comment