At least 1x A100 80G or what?
May be using ggml q4 quant to bring the model size down to around 100GB than using cpu to run it? (using bloomz.cpp).Might be very slow,but possible with 128G or more ram and a powerful cpu.
what about int8 mode and disk offloading?
Is there any guy to slim the model within the qlora methods?
Hey folks! Please follow https://github.com/sambanova/bloomchat here. For bf16 inference the minimal requirement is 880GB A100, for int8 inference the minimal requirement is 480GB A100. We are exploring other compression techniques and welcome any suggestion/contribution!