This is great.
Works with 20 tokens/sec on my 3090 at 8k context.
Is this without speculative decoding? If you use exui or TabbyAPI and enable speculative decoding with a TinyLLaMA 32K model, you can get even faster inference speeds. I can push 40 t/s with the 5.0bpw quant and this draft model:
https://huggingface.co/LoneStriker/TinyLlama-1.1B-32k-Instruct-3.0bpw-h6-exl2
Is this without speculative decoding? If you use exui or TabbyAPI and enable speculative decoding with a TinyLLaMA 32K model, you can get even faster inference speeds. I can push 40 t/s with the 5.0bpw quant and this draft model:
https://huggingface.co/LoneStriker/TinyLlama-1.1B-32k-Instruct-3.0bpw-h6-exl2
Thanks, but I have a question.How much VRAM is needed to run this model on exui and speed it up?Maybe it's because I have too little VRAM and the speculative decoding doesn't work.
Sorry, didn't read that you only got 8k context with the model. You won't have room for a speculative decoding model unfortunately unless we drop from 2.4 down to 2.18bpw (lowest possible quant.) But, 20 t/s is very good already.
FYI, at full 32K context, loading the 2.4bpw model plus a tiny draft model needs 24 GB + 12 GB VRAM. So 2.5 3090s or 4090s. But, you get token speeds of between 40-60 t/s on a 4090 + 3090 Ti.