Token Speeds for Q5_K_M?

by Dreifort - opened

I am trying out the ~22GB .Q5_K_M LLM model on system that uses a RTX 3060 with 12GB VRAM. What sort of speeds should I get from the Q5 model that is 2x my VRAM size? I currently get 0.8 t/s.

And any suggestions in improving the speed (without getting a better GPU)?


Sign up or log in to comment