Compared to the regular FP8 model, what is the better performance of the 8BIT model here
#16
by
demo001s
- opened
Compared to the regular FP8 model, what is the better performance of the 8BIT model here. Also, why is Q8-0 10 seconds faster than Q5-0. I tested it on NVIDIA 4090, Q8-0 takes 18 seconds and Q5-0 takes 28 seconds.
Shouldn't the smaller the model file, the faster it should be
Lots of people have reported Q5 being much slower, maybe it's more complicated when casting 5 bit to 16.
For 2080 ti performance is similar for all versions of quants 2.5s/it, weird.
For 2080 ti performance is similar for all versions of quants 2.5s/it, weird.
You're limited by the slowest link, which in this case is likely the on-the-fly dequantization math (probably some numpy operation or casting).