How to reproduce quantized memory usage?

#16

by tarasglek - opened Jul 27, 2023

Jul 27, 2023

I see that you have figures below 4gb. How do I reproduce that? When I use BitsAndBytesConfig(load_in_4bit=True), I get memory usage of 4.3gigs of VRAM on cuda.

daria-soboleva

Cerebras org Jul 28, 2023

I see that you have figures below 4gb. How do I reproduce that? When I use BitsAndBytesConfig(load_in_4bit=True), I get memory usage of 4.3gigs of VRAM on cuda.

We used redpajama cpp

Corianas

Jul 31, 2023

I use oobabooga/text-generation-webui.
on the load models page, select 8bit, or 4 bit, and cpu if you want. I find 8bit fits in 8gig gpu nicely, though not for full context window....

andersonbcdefg

Aug 2, 2023

@daria-soboleva would you be able to release or share code for the conversion to GGML? it's not clear how to do that conversion with this special model that doesn't exactly match any existing model.

daria-soboleva

Cerebras org Aug 2, 2023

@andersonbcdefg we understand the demand that quantized version of the btlm has and def scoping up the work to make it accessible by the community!

fouvy

Dec 3, 2023

@tarasglek Would you share you code about how to load in 4bit? I got Segmentation fault (core dumped) when I load model in quant 4bit.

KnutJaegersberg

Dec 9, 2023

@daria-soboleva I'm curious about this, too. So there is a redpajama ccp, have not known that exists.
did you do anything special with the library to make your quanization work?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment