Can you make a 2.4bpw quantization?

#1
by xldistance - opened

Thanks for quantifying the model

I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.

I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.

You can modify config.json's max_position_embeddings to 10000 and then you can use it under 3.0bpw, but the reply speed is only about 3 tokens/s, very slow!

2.65bpw quantization set max_position_embeddings to 10000, occupy more than 24GB of video memory, 4090 graphics card with very bad

I generally just take the original models' configurations. You can edit the file locally if you need it different than the base.

extremely grateful

Sign up or log in to comment