How to reproduce quantized memory usage?

#16
by tarasglek - opened

I see that you have figures below 4gb. How do I reproduce that? When I use BitsAndBytesConfig(load_in_4bit=True), I get memory usage of 4.3gigs of VRAM on cuda.

I see that you have figures below 4gb. How do I reproduce that? When I use BitsAndBytesConfig(load_in_4bit=True), I get memory usage of 4.3gigs of VRAM on cuda.

We used redpajama cpp

I use oobabooga/text-generation-webui.
on the load models page, select 8bit, or 4 bit, and cpu if you want. I find 8bit fits in 8gig gpu nicely, though not for full context window....

@daria-soboleva would you be able to release or share code for the conversion to GGML? it's not clear how to do that conversion with this special model that doesn't exactly match any existing model.

@andersonbcdefg we understand the demand that quantized version of the btlm has and def scoping up the work to make it accessible by the community!

@tarasglek Would you share you code about how to load in 4bit? I got Segmentation fault (core dumped) when I load model in quant 4bit.

@daria-soboleva I'm curious about this, too. So there is a redpajama ccp, have not known that exists.
did you do anything special with the library to make your quanization work?

Sign up or log in to comment