Performance

#1
by Hypersniper - opened

Performance is amazing on v1.5 , even better than 1.3 with the added bonus of 16k limit! I've tried so many other models but the Vicunia seems to work better with tasks / chat . Any thoughts?

It's great! I'm running on an RTX 3090 with ExLamma, I'm getting about 33-35 tokens per second with the 4 bit quantized model using a group size of 32. So far the inferences have been coherent, but I've yet to really try to break it. Very good model to play with.

@Ermarrero @nero-dv What settings are you using for 16k tokens? I played with the ExLamma settings, but I'm getting poor quality at higher token number. (rtx3090)

looks like max tokens according to the card is 8k? should we stick to that max for better coherence?

looks like max tokens according to the card is 8k? should we stick to that max for better coherence?

No the max is 16K. The 8192 shown in my GPTQ table is the sequence length I quantised at. I wasn't able to quantise at 16K yet.

That doesn't mean the model is limited to 8K, it just means the accuracy at 16K with the quantized version won't be quite as good as if I had been able to quantise at 16K. But it will still be very usable.

This is explained in the "Explanation of GPTQ parameters" section of the README.

Thank you @TheBloke
Could you possibly share the quantization command that you have used to quantize this model?

I use this AutoGPTQ wrapper script which I wrote a while ago: https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682#file-quant_autogptq-py

with the parameters shown in the README. To quantise at 8K, I needed to use --cache-examples 0 which sets cache_examples_on_gpu=False in the AutoGPTQ .quantize() call. Otherwise 8K samples will cause out of VRAM on a 48GB GPU (which I used for these), and maybe even on an 80GB GPU. This uses RAM instead of VRAM and slows the process down quite a bit, but allows me to quantize at a higher sequence length.

You are the best. May God bless you for your effort and commitment!

I have a repetition problem, the same as the one mentioned here: https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-GGML/discussions/1

@TheBloke you mentioned that this is related to rope parameters, is it the same for the GPTQ version? How can we adjust it in this version?
(Thank you for the amazing work btw, since I work with limited resources, your models are saving my life :))

Sign up or log in to comment