Why does model work much slower in 8-bit mode?

by SergeyOvchinnikov - opened Jan 24

Jan 24

I have tried the model with different parameters and found out that in 8-bit quantization it works much (3-4-5 times) slower than in 16-bit quntization.
Parameter load_in_8bit=True.
Could you please advise why is it so?
Is it possible to inference in 8-bit as faster as in 16-bit mode or similar?
Thanks in advance!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment