Text Generation
Transformers
PyTorch
Safetensors
English
stripedhyena
custom_code

Quantization pls?

#3
by Yhyu13 - opened

@TheBloke @LoneStriker

Is this model ready to be quantized, or not?

Thanks!

It's a new architecture, so from the exl2 quant side, Turbo will have to add support for the model before it can be quantized. If it's supported in Transformers, TheBloke may be able to generate GPTQ quants though.

You can use load_in_8bit or load_in_4bit (in fact that's there only method I got to load in 16GB ram) to quantize on the fly.

Also FFT is memory hungry. In torch 2.1.1 it can eat 2GB of memory for cache (not emptied by empty_cache) if you are not careful enough, in old it just leaked the memory and it doesn't support bf16

Fortunately FlashFFTConv are coming and there's recurrent prefill mechanism(I haven't tried it)

Sign up or log in to comment