Can you quantize further? Like FP4 maybe?
#1
by
etohimself
- opened
Fp16 takes around 10 seconds to generate 1-2 sentence response. If you can quantize it to FP4 I think we can achieve real time
Hi!
Sure, I was doing some quantisation experiments earlier to convert it to fp8/int8 but I kept facing some library issues so put it on hold. I’ll try it again for fp8 and fp4 later today and report back.
You are doing gods work 🫶🫶
How does one actually approach quantization for this? If you succeed, could you post a guide somewhere or send me a quick rundown?