Compared to the regular FP8 model, what is the better performance of the 8BIT model here

#16

by demo001s - opened Aug 19

Aug 19

Compared to the regular FP8 model, what is the better performance of the 8BIT model here. Also, why is Q8-0 10 seconds faster than Q5-0. I tested it on NVIDIA 4090, Q8-0 takes 18 seconds and Q5-0 takes 28 seconds.

demo001s

Aug 19

Shouldn't the smaller the model file, the faster it should be

CHNtentes

Aug 19

Lots of people have reported Q5 being much slower, maybe it's more complicated when casting 5 bit to 16.

pandalay

Aug 19

For 2080 ti performance is similar for all versions of quants 2.5s/it, weird.

rx808

Aug 19

•

edited Aug 19

For 2080 ti performance is similar for all versions of quants 2.5s/it, weird.

You're limited by the slowest link, which in this case is likely the on-the-fly dequantization math (probably some numpy operation or casting).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment