I fine-tuned llama-2 on my dataset and now I want to convert it to the gptq_model-4bit-128g.safetensors format. Could you please tell me how I can do this? What script or method can I use to achieve this?
Install AutoGPTQ 0.3.2, which I recommend you do from source due to some install issues at the moment:
pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
pip3 install .
Then here's an AutoGPTQ wrapper script I've written, and which I use myself to make these models: https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682#file-quant_autogptq-py
python3 quant_autogptq.py /path.to/unquantised-model /path/to/save/gptq wikitext --bits 4 --group_size 128 --desc_act 0 --damp 0.1 --dtype float16 --seqlen 4096 --num_samples 128 --use_fast
The example command will use the wikitext dataset for quantisation. If your model is trained on something more specific, like code, or non-English language, then you may want to change to a different dataset. Doing that would require editing
quant_autogptq.py to load an alternative dataset.
First of all, I want to thank you for the quick and detailed response. Secondly, I want to thank you for your work; you make an invaluable contribution to the community.
In this message, I wanted to find out how I can convert my model from the hf or q4_0.bin format (for example) to the gptq_model-4bit-128g.safetensors format. Could you please advise me on how I can do this? Thank you very much for your attention.
I already described how to convert from HF to GPTQ. To convert HF to GGML, use this script: https://github.com/ggerganov/llama.cpp/blob/master/examples/make-ggml.py
Great approach, thanks. So, the main part of my question is how to convert my model to the format used in this repository, which means having the model with ".safetensors" at the end and the other accompanying files. Could you please guide me on how to do this?
I already described that in detail here: https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ/discussions/26#64ca8db71af278541d4a53dd
Hey pal, is it possible to fine-tune a 4-bit GPTQ model?
My GPU has limited memory. I'm really unable to fine-tune the original HF model.
Sorry for hijacking the thread.
@TheBloke Hi, I followed your instructions for converting my Llama-2-hf model to a 4bit 128 group quantised model through the script you posted https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682#file-quant_autogptq-py it worked fantastically. But when I try to load the model in oobabooga/text-generation-webui I get the following error:
OSError: Can’t load tokenizer for ‘models/llama-2-7b-hf-GPTQ-4bit-128g’. If you were trying to load it from ‘https://huggingface.co/models’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘models/llama-2-7b-hf-GPTQ-4bit-128g’ is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.
Any help would be much appreciated. Just point me in the right direction :D I am having a hard time googling what the problem is...