Performance

by Hypersniper - opened Aug 3, 2023

Aug 3, 2023

Performance is amazing on v1.5 , even better than 1.3 with the added bonus of 16k limit! I've tried so many other models but the Vicunia seems to work better with tasks / chat . Any thoughts?

nero-dv

Aug 10, 2023

It's great! I'm running on an RTX 3090 with ExLamma, I'm getting about 33-35 tokens per second with the 4 bit quantized model using a group size of 32. So far the inferences have been coherent, but I've yet to really try to break it. Very good model to play with.

mikolodz

Aug 10, 2023

@Ermarrero @nero-dv What settings are you using for 16k tokens? I played with the ExLamma settings, but I'm getting poor quality at higher token number. (rtx3090)

Renegadesoffun

Aug 10, 2023

looks like max tokens according to the card is 8k? should we stick to that max for better coherence?

TheBloke

Owner Aug 10, 2023

looks like max tokens according to the card is 8k? should we stick to that max for better coherence?

No the max is 16K. The 8192 shown in my GPTQ table is the sequence length I quantised at. I wasn't able to quantise at 16K yet.

That doesn't mean the model is limited to 8K, it just means the accuracy at 16K with the quantized version won't be quite as good as if I had been able to quantise at 16K. But it will still be very usable.

This is explained in the "Explanation of GPTQ parameters" section of the README.

mikolodz

Aug 10, 2023

Thank you @TheBloke
Could you possibly share the quantization command that you have used to quantize this model?

TheBloke

Owner Aug 10, 2023

•

edited Aug 10, 2023

I use this AutoGPTQ wrapper script which I wrote a while ago: https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682#file-quant_autogptq-py

with the parameters shown in the README. To quantise at 8K, I needed to use --cache-examples 0 which sets cache_examples_on_gpu=False in the AutoGPTQ .quantize() call. Otherwise 8K samples will cause out of VRAM on a 48GB GPU (which I used for these), and maybe even on an 80GB GPU. This uses RAM instead of VRAM and slows the process down quite a bit, but allows me to quantize at a higher sequence length.

mikolodz

Aug 10, 2023

You are the best. May God bless you for your effort and commitment!

mertyazan

Aug 11, 2023

I have a repetition problem, the same as the one mentioned here: https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-GGML/discussions/1

@TheBloke you mentioned that this is related to rope parameters, is it the same for the GPTQ version? How can we adjust it in this version?
(Thank you for the amazing work btw, since I work with limited resources, your models are saving my life :))

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment