Slow prompt processing
#2
by
OrangeApples
- opened
Hi @dranger003 ! Thanks for the quants. Is the prompt processing of the IQ2_X2 supposed to be this slow (9.83T/s) even when the model is fully offloaded? I'm using the latest Nexesenex fork of KCPP.
@OrangeApples I guess it really depends on your GPU but I think that 10 t/s should be around what I would expert on a 3090. Also note IQ2_XXS is quite faster than IQ2_XS and quality isn't degraded from IQ2_XS (at least not that I could notice).
Thanks! Yes, I'm using a 3090. Will give IQ2_XXS a shot as well.
Edit: Turns out my prompt processing was spilling over to the system ram. After reducing the context from 12k to 10k, I got 227T/s for PP.
OrangeApples
changed discussion status to
closed