Your converted model performs better, and I don't understand why

#4
by Thireus - opened

Hi @TheBloke ,

For some reason your model has slightly better PPL than any of the 4bit-128g versions I recently converted. I say "any" because I tried various combinations of GPTQ commits and transformers versions.

Some metrics for some versions I've tried:

Model wikitext2 PPL ptb-new PPL c4-new PPL
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g 7.119165420532227 35.637290954589844 9.550592422485352
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit 508de42ff45ec560a4504e12b0d42114d599cf38) 7.129854202270508 35.848060607910156 9.568032264709473
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit d89cdcd8b53f61346290a28d326816af6a028434) 7.137491226196289 35.530372619628906 9.597953796386719
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit f3f7a6910fd6778548cdafe7f0d5155411b1696c) 7.137701988220215 35.52903366088867 9.597844123840332
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit 49ffd9ab085004978a6bdc8e2dff7510f2458e71) 7.137701988220215 35.52903366088867 9.597844123840332
pip freeze | grep transformers
transformers @ git+https://github.com/huggingface/transformers@5bb4ec6233d6414a922ad2818f0bcf879de81c28

Would you have an idea why that is and what can influence this? Did you use Triton or Cuda?

Oh, that's interesting.

To be honest I don't think I did anything special. The commands I use are in my READMEs, but for example:

CUDA_VISIBLE_DEVICES=0 python3 llama.py vicuna-13B-1.1-HF c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors /workspace/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors 

Lately I have always been using the Triton branch for making GPTQs. And I made this GPTQ - and all my recent ones - with commit 58c8ab4c7aaccc50f507fd08cce941976affe5e0on the qwopqwop repo. Which was the last commit on April 13th, before he started the refactor that broke everything.

In terms of pip versions. I can't check precise commits because I do all of this in the cloud and each instance gets destroyed afterwards. Until recently I was pulling peft and transformers from their respective Githubs, but after transformers finally released 4.28.0 a couple of days ago I started using the standard versions. I did this GPTQ four days ago so I think that would have been using Github transformers. PyTorch is 2.0.0+cu118. Triton is 2.0.0.

Here's my initialisation script that installs all my dependencies:

echo -n "PIP OTHERS: "
(pip3 uninstall -qy transformers peft datasets loralib sentencepiece safetensors accelerate triton bitsandbytes huggingface_hub flexgen rwkv quant-cuda && \
pip3 install -q datasets==2.10.1 loralib sentencepiece safetensors==0.3.0 accelerate==0.18.0 triton==2.0.0 huggingface_hub && \
pip3 install -q transformers && \
pip3 install -q peft && \
pip3 install -q bitsandbytes==0.37.2 xformers && \
pip3 install -q markdown pyyaml tqdm requests gradio==3.24.1 flexgen==0.1.7 rwkv==0.7.3 ninja ) >/dev/null 2>errors.pip && echo " DONE" || cat errors.pip

echo -n "GIT SETUP: "
( git config --global credential.helper store && \
git config --global user.email "XXX" && \
git config --global user.name "TheBloke" && \
huggingface-cli login --add-to-git-credential --token 'XXX' && \
git lfs install ) >/dev/null 2>errors.gitsetup && echo " DONE" || cat errors.gitsetup

echo -n "GIT GPTQ: "
( git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa gptq-llama && \
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda gptq-llama-cuda && \
git clone https://github.com/oobabooga/GPTQ-for-LLaMa ooba-gptq-llama && \
git clone -n  https://github.com/qwopqwop200/GPTQ-for-LLaMa gptq-safe && cd gptq-safe && git checkout 58c8ab4c7aaccc50f507fd08cce941976affe5e0 ) >/dev/null 2>errors.gitgptq && echo " DONE" || cat errors.gitgptq

echo -n "WEBUI SETUP: "
( rm -rf /content/text-generation-webui && \
git clone https://github.com/oobabooga/text-generation-webui && \
mkdir -p text-generation-webui/repositories && \
ln -s /content/gptq-safe text-generation-webui/repositories/GPTQ-for-LLaMa   ) >/dev/null 2>errors.webui && echo " DONE" || cat errors.webui 

I actually re-made this particular GPTQ four days ago, because I had realised that my original vicuna-13B-1.1-HF repo had been converted to HF with a buggy version of the transformers models/llama/convert_llama_weights_to_hf.py script which caused the 13B models to use 37GB on disk instead of 26GB. So I re-converted my vicuna-13B-1.1-HF repo, and then just in case that affected the GPTQ I also re-made the GPTQs.

No idea if any of that would affect this, but that's what I did! I suppose that might mean I used a later version of GPTQ-for-LLaMa than you did?

Question for you if you've got a sec: what code/method do you use to produce those metrics? So far the only evaluation I've done on models has been for GGML models, using llama.cpp's perplexity binary. I've been meaning to try evaluating GPU models but haven't looked into it yet.

Thank you for the detailed answer. I'll look into this!

I am using cuda117 instead of cuda118, but I doubt that could be it... I also use a more recent version of bitsandbytes==0.38.1.
All on WSL, but I'm thinking of giving it a try in Google Colab (I believe this is what you're using).

To generate the metrics, enter the directory where you have your safetensor and execute:
python /content/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py . c4 --wbits 4 --groupsize 128 --load vicuna-13B-1.1-GPTQ-4bit-128g.safetensors --new-eval --eval

You can also try instead of --new-eval --eval to use --eval alone or --benchmark 2048 --check.

I remember our conversation about the size difference. ;)

Oh that was you! Sorry. I've had so many discussions here since I started uploading models that I can't remember all the different usernames :)

Thanks for the details on the evaluation, I'll have to try that.

And yes I have used Google Colab a lot as I don't have an NVidia GPU at home yet - my home GPU is an AMD 6900XT 16GB on macOS, and that's not well supported at all for this sort of thing. Also my internet upload is only 4MB/s which takes forever when dealing with larger models. Uploads from the cloud are way quicker. And if I need to reboot my PC or something, it won't interrupt anything.

Most of my GPTQs I did with Colab, but lately I've started to move over to Runpod. They have a lot more hardware options, and they support SSH and allow you to open any TCP ports if you want which Colab doesn't officially support. On Google Colab there's only two GPU options: T5 15GB or A100 40GB. I found that for most of what I was doing, T5 was too small and slow, and A100 was more than I needed. On Runpod I can pay $0.29/hr for a 3090, $0.69/hr for a 4090, $0.89/hr for an A100 40GB or $1.89/hr for an A100 80GB. And then when it's done it's usually way faster to upload to HF from a server than it is from my home internet.

Sign up or log in to comment