Text Generation
Transformers
Safetensors
English
llama
sft
text-generation-inference
4-bit precision
gptq

Many thanks

#1
by gsaivinay - opened

Hello,

As always, thanks for providing this quantized model, awesome as always.

Could you provide with which dataset and sequence length, this model is quantized.

My question is, if we are quantizing 8k model with only 2k sequence length, are we going to lose performance?

I used wikitext2 at 4096 sequence length. I tried at 8192 but I ran out of VRAM even on a 48GB card.

It's a good question and I don't know the answer yet. Since Llama 2 came out I have been quantising all Llama 2 models at 4096 sequence length, which takes longer and uses way more VRAM. But I haven't actually analysed the results to see how much difference it makes. That's something I plan to do when I have time. If I find it makes little or no difference, I'll go back to 2048 because it's much easier to do.

If I find that it does make a useful difference, I might come back to this model and try making an 8192 sequence length quant using an 80GB GPU. But for now I think 4096 will still be fine.

Agreed. If you are able to run the evaluation script used by open llm leaderboard, then there might be better idea whether performance is okay or degraded. But if it is difficult to run such huge evaluation, then we need to find better way to quickly evaluate performance.

Actually! I think I just figured out how to do it at 8192. It's running now. I will compare the perplexity so I can do that analysis I mentioned above, and see if it actually makes a difference.

If it does, I will re-do them all at 8K

Actually! I think I just figured out how to do it at 8192. It's running now.

That is awesome. If you could, please let us know how you did it once it is successful.

I'm not going to do the open eval, it takes forever. I'll just do perplexity. If that shows no difference we know there is no difference. If it shows a medium sized difference, we still don't know how much 'real' difference there is and then maybe eval would be appropriate, but a small perplexity difference usually means it will be indistinguishable. If it's a larger difference then I'll definitely try to do the full context in all cases

I changed two things but I think only one matters. The one that matters is to set cache_examples_on_gpu=False in the AutoGPTQ .quantize()call. As the name suggests, this does not store the quantisation samples (the 128 x 8192 token strings taken at random from wikitext) on the GPU. Instead it stores them in RAM. I guess it slows it down even more, but also appears to make a big difference to VRAM usage.

The thing I changed that I think is having little or no effect is adding environment variable PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32. This changes how VRAM is allocated by PyTorch. With this set, it stores the data in smaller chunks. This helps prevent running out of VRAM in situations where you are very close to 100% used.

But with the change to no longer cache examples, I am now only using 50-60% of the 48GB VRAM I have, so I don't think that CUDA_ALLOC change is affecting anything. In fact I'm now using less VRAM at 8192 context than I was at 4096, when the examples were cached on GPU.

I did the perplexity calculations. I could only do them at 2K because it failed at longer lengths - probably a limitation in the PPL code. But even at 2K it showed a detectable difference.

Testing the best quant - the 32g + act-order file: testing PPL with wikitext, the same dataset I used for quant, it's about 1.5%. Not much. With PTB it's 7%, which is more significant.

image.png

How this translates into real life results, if at all, I do not know. But for now I guess I will be doing 8K models at 8K context. It took 2.5 hours to do using A6000s!

The re-done GPTQs with 8k context are now uploaded, or will be shortly

Thanks! How can I directly download the gptq-8bit-64g-actorder_True version without going through the webui. I'd like to use Internet Download Manager as I have control over the speed of the internet and auto resumability. Thank you.

If you want to use a download manager, I think you'll need to manually switch to the desired branch in the Files and Version tab, and copy the download link for each file (there's a little icon for it, next to each file). Bit tedious.

Generally there are three options for automatic download of branches:

  1. git clone -b branch-name https://huggingface.co/TheBloke/OpenAssistant-Llama2-13B-Orca-8K-3319-GPTQ - not recommended because Git stores twice as much data, taking double disk space. And you need to install Git LFS first. Also it's usually really slow.

  2. Run download-model.py provided with text-generation-webui, with the --branch parameter. That runs the same code as the UI does, but from the command line. eg python3 download-model.py TheBloke/OpenAssistant-Llama2-13B-Orca-8K-3319-GPTQ --branch gptq-4bit-128g-actorder_True which will download the model to models/TheBloke_OpenAssistant-Llama2-13B-Orca-8K-3319-GPTQ_gptq-4bit-128g-actorder_True. You can use the --threads X parameter to use X threads at once, which can help to maximise your download speed. eg add --threads 4

  3. Python code like the following:

from huggingface_hub import snapshot_download

model_name = "TheBloke/OpenAssistant-Llama2-13B-Orca-8K-3319-GPTQ"
branch = "gptq-4bit-32g-actorder_True"
local_base_folder = "/workspace"
local_folder = os.path.join(local_base_folder, f"{model_name.replace('/', '_')}_{branch}")

snapshot_download(repo_id=model_name, local_dir=local_folder, revision=branch, local_dir_use_symlinks=False)

That will download branch gptq-4bit-32g-actorder_True to a folder under /workspace - change /workspace to an appropriate local folder for you.

To maximise your download speed with the Python option, do pip3 install hf_transfer and then set environment variable HF_HUB_ENABLE_HF_TRANSFER=1 while running your script. That will usually max out any internet connection, even 10Gbit/s.

Eg if you saved the above script to download.py, then in Linux you can do: HF_HUB_ENABLE_HF_TRANSFER=1 python3 download.py to get that maximised download.

Thank for your quick reply Tom. But, I always use IDM to download models, but there is usually one file: the gptq_model-4bit-128g.safetensors. However, I'd like to download the 8-bit model (gptq-8bit-64g-actorder_True), which is not present in the Files and Version tab. I would appreciate if you could direct me at the other quantized versions as well.

Apologies, I didn't notice that I can select the models individually. I was still thinking like the GMLL way where all the models are in one folder :) My bad, and you rock.

image.png

No worries! Actually that reminded me to roll out a new template feature I'd been meaning to do. From now on, all the branch names under Provided Files are clickable URLs, linking to the appropriate Files and Folders section:

image.png

I noticed that. It's much better and practical. Thank you very much.

I did the perplexity calculations. I could only do them at 2K because it failed at longer lengths - probably a limitation in the PPL code. But even at 2K it showed a detectable difference.

Testing the best quant - the 32g + act-order file: testing PPL with wikitext, the same dataset I used for quant, it's about 1.5%. Not much. With PTB it's 7%, which is more significant.

image.png

How this translates into real life results, if at all, I do not know. But for now I guess I will be doing 8K models at 8K context. It took 2.5 hours to do using A6000s!

The re-done GPTQs with 8k context are now uploaded, or will be shortly

This is excellent. Thanks for this

gsaivinay changed discussion status to closed

Sign up or log in to comment