Text Generation
Russian
conversational

Why does model work much slower in 8-bit mode?

#4
by SergeyOvchinnikov - opened

I have tried the model with different parameters and found out that in 8-bit quantization it works much (3-4-5 times) slower than in 16-bit quntization.
Parameter load_in_8bit=True.
Could you please advise why is it so?
Is it possible to inference in 8-bit as faster as in 16-bit mode or similar?
Thanks in advance!

Sign up or log in to comment