Can you quantize further? Like FP4 maybe?

#1
by etohimself - opened

Fp16 takes around 10 seconds to generate 1-2 sentence response. If you can quantize it to FP4 I think we can achieve real time

Hi!

Sure, I was doing some quantisation experiments earlier to convert it to fp8/int8 but I kept facing some library issues so put it on hold. I’ll try it again for fp8 and fp4 later today and report back.

You are doing gods work 🫶🫶

How does one actually approach quantization for this? If you succeed, could you post a guide somewhere or send me a quick rundown?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment