How to convert model into GGML format?

#13
by zbruceli - opened

Hi, I have fine-tuned a LLaMA-2 7B model using the Philip Schmid tutorial (https://www.philschmid.de/instruction-tune-llama-2) and have merged the LoRa weights back into the original weights. Now how can I further convert the weights into GGML format and 4 bit quantization, so I can run in llama.cpp?

These are the files in my merged model:

Screenshot 2023-07-28 at 11.17.41 AM.png
Thanks!

https://github.com/ggerganov/llama.cpp

ctrl+f for 'convert' and you'll find some script

Ok, I've dug in more on this and it's tricky...

  1. I don't know what the format of the input model to convert.py needs to be. float 32 or bf16? See this new issue

  2. A lot of models on huggingface have shards of 10GB max. I don't know how to handle shards with the convert.py script.

I tried to do it with bf16 bin files, where I concatenated the shards, but I ran into a key error (running in colab):

!python3 convert.py ../models/
Loading...

Loading model file ../models/pytorch_model.bin
vocabtype: spm
Loading vocab file ../models/tokenizer.model
params: n_vocab:32000 n_embd:4096 n_mult:5504 n_head:32 n_layer:32
Traceback (most recent call last):
  File "/content/llama.cpp/convert.py", line 1326, in <module>
    main()
  File "/content/llama.cpp/convert.py", line 1317, in main
    model = do_necessary_conversions(model, params)
  File "/content/llama.cpp/convert.py", line 1146, in do_necessary_conversions
    model = convert_transformers_to_orig(model, params)
  File "/content/llama.cpp/convert.py", line 737, in convert_transformers_to_orig
    out["tok_embeddings.weight"] = model["model.embed_tokens.weight"]
KeyError: 'model.embed_tokens.weight'

After solving that, I would have to see if the following works in a colab notebook:

!./quantize ./models/ggml-model-f16.bin ./models/ggml-model-q3_K_M.bin q3_K_M

ok, issue resolved here:

https://github.com/ggerganov/llama.cpp/issues/2571

some code snippets here:

python3 convert.py ./ --outtype f16

and

./quantize ./ggml-model-f16.bin ./ggml-model-q3_K_M.bin q3_K_M

Thank you @RonanMcGovern

There's a script included with llama.cpp that does everything for you. It's called make-ggml.py. It's based off an old Python script I used to produce my GGML models with.

btw, why does ggml quantization require a tokenizer? Does the tokenizer end up influencing the way the quantization occurs?

ok, issue resolved here:

https://github.com/ggerganov/llama.cpp/issues/2571

some code snippets here:

python3 convert.py ./ --outtype f16

and

./quantize ./ggml-model-f16.bin ./ggml-model-q3_K_M.bin q3_K_M

@RonanMcGovern
Thanks a lot for the sharing. Do you know why the conver.py script doesn't recognize the pytorch model bin file here?
It stopped at processing the 1st of 7 bin model files.

(lab) aaron@LIs-MacBook-Pro llama2 % python llama.cpp/convert.py llama-2-7b-liaaron1 --outtype f16
Loading model file llama-2-7b-liaaron1/pytorch_model-00001-of-00007.bin
Traceback (most recent call last):
File "/Users/aaron/Downloads/llama2/llama.cpp/convert.py", line 1112, in
main()
File "/Users/aaron/Downloads/llama2/llama.cpp/convert.py", line 1061, in main
model_plus = load_some_model(args.model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/aaron/Downloads/llama2/llama.cpp/convert.py", line 985, in load_some_model
models_plus.append(lazy_load_file(path))
^^^^^^^^^^^^^^^^^^^^
File "/Users/aaron/Downloads/llama2/llama.cpp/convert.py", line 720, in lazy_load_file
raise ValueError(f"unknown format: {path}")
ValueError: unknown format: llama-2-7b-liaaron1/pytorch_model-00001-of-00007.bin

Appreciate your help

Aaron

I have a colab notebook that I used to quantize LLaMA 2 13B chat model to gguf (available in my repo).

I didn't do Q3 unfortunately. If you want I can share the notebook.

You have to just replace the existing model with appropriate model you want to quantize.

@liaaron1 , there's nothing obviously wrong to me, but it may be worth putting the .bin files into the same folder as the script so you can run the exact command. Another debug option I would try is to just the raw llama files as a test.

@akarshanbiswas Please do share!

@liaaron1 , there's nothing obviously wrong to me, but it may be worth putting the .bin files into the same folder as the script so you can run the exact command. Another debug option I would try is to just the raw llama files as a test.

@RonanMcGovern Got the same error after having moved the scripts into the same folder as the .bin files. It seemed the scripts expecting .pt files instead? Which format worked for you?

This comment has been hidden

The model I was trying to convert was fine tuned on top of https://huggingface.co/guardrail/llama-2-7b-guanaco-instruct-sharded/ - 4-bit precision using QLoRA by the way

Problem solved after manually downloaded the model files to my locak disk again. I was working on bin files with invalid contents. Sorry for the confusion.

Ok, I've dug in more on this and it's tricky...

  1. I don't know what the format of the input model to convert.py needs to be. float 32 or bf16? See this new issue

  2. A lot of models on huggingface have shards of 10GB max. I don't know how to handle shards with the convert.py script.

I tried to do it with bf16 bin files, where I concatenated the shards, but I ran into a key error (running in colab):

!python3 convert.py ../models/
Loading...

Loading model file ../models/pytorch_model.bin
vocabtype: spm
Loading vocab file ../models/tokenizer.model
params: n_vocab:32000 n_embd:4096 n_mult:5504 n_head:32 n_layer:32
Traceback (most recent call last):
  File "/content/llama.cpp/convert.py", line 1326, in <module>
    main()
  File "/content/llama.cpp/convert.py", line 1317, in main
    model = do_necessary_conversions(model, params)
  File "/content/llama.cpp/convert.py", line 1146, in do_necessary_conversions
    model = convert_transformers_to_orig(model, params)
  File "/content/llama.cpp/convert.py", line 737, in convert_transformers_to_orig
    out["tok_embeddings.weight"] = model["model.embed_tokens.weight"]
KeyError: 'model.embed_tokens.weight'

After solving that, I would have to see if the following works in a colab notebook:

!./quantize ./models/ggml-model-f16.bin ./models/ggml-model-q3_K_M.bin q3_K_M

@RonanMcGovern Can you help me out how did you resolved this embeddings error. I am Stuck :) and do convert.py can convert pytorch model "jphme/Llama-2-13b-chat-german" also as above Model has 3 .bin files so we need to convert all files, or converting 1 model can work effectively??
image.png

@komal-09 actually the script handles all of this (multiple files), take a look at this github issue: https://github.com/ggerganov/llama.cpp/issues/2571

Ya That's right, but My question is as mentioned in documentation I added tokenizer.model file into model's folder and while running the convert.py scripts passed the path of my .bin file. then also this 'tok_embeddings.weight' error is coming.

This is the Repo hierarchy, what all should I include in Models folder of llama.cpp to get rid of this issue anything I am missing?

image.png

Hi @RonanMcGovern , currently I was trying to quantise the llama2 13b fine tuned model with the help of llama.cpp
But I am just able to execute step of convert.py but not able to run the ./quantize command and as well if I see the repo I don’t see any such file either

Maybe this quantize is being integrated inside of convert.py itself?

But i see that with the current convert.py I am able to quantize it to q8_0 other than the fp32 and fp16

I was able to convert and quantize the fine-tuned model (llama2-7B, QLoRA, Dolly-15K dataset).But during inference there is an error

error loading model: create_tensor: tensor 'output_norm.weight' not found

I'm a bit puzzled and cannot seem to find any info. Does any of you encounter this issues?

I think it’s not properly quantized?
What was the code snippet that you have used for the quantization of model?
And what technique have u used?
Did u try with other quant methods and check whether u were getting the same error? @zbruceli

Hi folks, haven't had time to dig in deep here, but here is a gguf script that may be of some help if you want to quantize with Colab.

GGML is getting deprecated so probably it's best to quantize to gguf.

Thanks for the help @RonanMcGovern , will give it a try!

I think it’s not properly quantized?
What was the code snippet that you have used for the quantization of model?
And what technique have u used?
Did u try with other quant methods and check whether u were getting the same error? @zbruceli

I was using the llama.cpp instructions to convert models to gguf format. It works perfectly with original meta-llama2-7B model, but had the problems when converting QLoRA trained model (after merging). I was using OVH Cloud tutorial and notebook for the QLoRA fine-tuning. https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/

First step: use llama.cpp convert.py to convert model to F16. There was one error

"Could not find tokenizer.model in models/ovh7b or its parent".

So I copied the tokenizer.model from original meta-llama2-7B model files. Then the convert script works correctly.

Then I quantize to q4.0 and it also worked.

But when I use llama.cpp to do intereference, I got the error of

error loading model: create_tensor: tensor 'output_norm.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/ovh7b/ggml-model-q4_0.gguf'

How did u quantize your model to q4_0?
in my case when i try to execute ./quantize from llama.cpp repo i get "no such file or directory found" error.
can u help me with this @zbruceli ?

Just a naive queston, will changing the model format to .bin work for your inference use?

Hi folks, haven't had time to dig in deep here, but here is a gguf script that may be of some help if you want to quantize with Colab.

GGML is getting deprecated so probably it's best to quantize to gguf.

This works like a charm, thanks for the help @RonanMcGovern , earlier i missed installing via cmake, hence was not able to find the quantize file.

Hi ,

I am not able to Quantize my model after running convert.py from Llama.cpp the mode has been converted into gguf type but while running '
./quantize C:\PrivateGPT\privategpt\privateGPT-main\llama.cpp-master\models\ggml-model-f16.gguf C:\PrivateGPT\privategpt\privateGPT-main\llama.cpp-master\models\ggml-model-q4_0.gguf q4_0

Error Occured :- ./quantize is not a cmdlet or script function.
Any Suggested solutions?
Also, I am trying to work on imartinez/privateGPT and trying to load model
image.png
but this line is giving me Validation error.
image.png
Please Help I am very beginner in all these Need help in learning as no professional courses are available related to LLM's and GPT.

You guys know I've done all these models in GGUF now? You could just use mine: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF - no need to make your own if you don't want to.

@komal-09 sorry I have no recent experience with PrivateGPT or GPT4All. But if you're trying to load GGML files with it, it might be that it only now supports GGUF. Try the GGUF files instead; GGML is no longer supported by many tools.

yes, that's possible reason and I have converted my model into GGUF but not able to Quantize it.
image.png
as Written in Readme of Llama.cpp

@TheBloke I know i am one who has appreciated greatly the work you have been doing for the community. With the recent move to GGUF i started experimenting with doing it myself. Basically 'self empowerment' for the next time they change formats on us. And watching what you were doing, helped greatly in that venture.

Originally i honestly did not think i had the resources ( stuck with an older 12gb vGPU titan, fried my 24gb tesla ), but the conversion+quantization is not bad at all and only takes a few minutes on my non GPU machine, for a 13B model. ( training, forget it, unless i lease GPU time.. ).

deleted

@komal-09 You did compile it.. right? The python stuff works from the repository out of box since its just a script, but that tool isn't an executable until you 'make' it.

@Nurb432 great to hear - and yeah making GGUFs is very light and efficient, needing very few resources. They have done a great job on making it use as little RAM as possible. Pretty much any PC can make GGUFs, even of big models like 70B.

@komal-09 Assuming you compiled or downloaded already-compiled binaries, then on Windows it would be quantize.exerather than ./quantize

@TheBloke Yes
image.png
using above steps, I did build it. but 'quantize.exe' also didn't work giving same error. Scripts I can see in my directory are as follows
image.png

Ay way from where I can download quantize.exe direclty ?

Afterr going through the entire process, I wrote down the successful path of fine-tuning and then convert to gguf for llama.cpp use

https://hackernoon.com/the-cheapskates-guide-to-fine-tuning-llama-2-and-running-it-on-your-laptop

Hi folks, haven't had time to dig in deep here, but here is a gguf script that may be of some help if you want to quantize with Colab.

GGML is getting deprecated so probably it's best to quantize to gguf.

This works like a charm, thanks for the help @RonanMcGovern , earlier i missed installing via cmake, hence was not able to find the quantize file.

Hi @komal-09 , try installation steps from this script which ronan has provided and see.

@SanjuEpic I did try this approach by setting up again but as in @RonanMcGovern Collab file on ls -1 we can see Quantize file available just above readme in mine directory it is not available :)
please if there is any source from where I can download quantize file do Let me know it's urgent.

Did u try that exact installation process but still getting the error?
If so then I’m not aware of how to resolve your problem :(

@komal-09 try these steps once before doing ./quantize, even i had the similar issue previously, once u build it then qunatize executable will be visible

cd llama.cpp/
apt get update
apt install build-essential git cmake libopenblas-dev libeigen3-dev
make LLAMA_OPENBLAS=1
ls

on windows apt command is not valid, can you give alternative command.

@komal-09 just download a pre-built release for Windows: https://github.com/ggerganov/llama.cpp/releases

It will have main.exe, quantize.exe, and everything else. No need to build it yourself.

If you have an NVidia GPU, pick the cu11.7.1 version if you use CUDA toolkit 11.x, or cu12.1.0 version if you use CUDA toolkit 12.x.

If you don't have an NVidia GPU or don't plan to use it, pick llama-b1215-bin-win-avx2-x64.zip if you have a modern CPU, or llama-b1215-bin-win-avx-x64.zip if you have an old CPU (more than 7+ years old)

@TheBloke Thank you so much quantization worked 😊

But Still Llamacpp not able to load my model.

image.png

Afterr going through the entire process, I wrote down the successful path of fine-tuning and then convert to gguf for llama.cpp use

https://hackernoon.com/the-cheapskates-guide-to-fine-tuning-llama-2-and-running-it-on-your-laptop

Where to find those 3 ggml files ? And also first and second steps commands are same in your post.

My bad, the second step command was a copy paste error. I already updated the article and the correct one should be:

python3 convert.py models/lora

how to solve this error ?
llama_model_quantize: failed to quantize: failed to open ./ggml-model-f16.bin: No such file or directory

It's kind of messy @RadarSISA as I don't believe you can use the push_to_hub command.

It is possible by connecting to the repo using git or using the huggingface libraries here: https://huggingface.co/docs/huggingface_hub/v0.16.3/guides/upload

please any one tell, after making gguf file my models directory has following files.

ggml-vocab-llama.gguf
generation_config.json
pytorch_model-00001-of-00002.bin
pytorch_model-00002-of-00002.bin
pytorch_model.bin.index.json
tokenizer.model
ggml-model-f16.gguf
config.json
ggml-model.gguf

what are the necessary files for inferencing ? Actually it is taking same amount of RAM as without gguf. And how to do inferencing. I'm using following code for inferencing.

from transformers import pipeline
pipe = pipeline("text-generation", model="/content/drive/MyDrive/my_llama_cpp/llama.cpp/models")
user_prompt = "What is a SISA Radar?"
system_prompt = "You are a knowledgeable and helpful AI assistant at SISA Information Security Private Limited."
result = pipe(f"[INST] <> {system_prompt} <> {user_prompt} [/INST]")
generated_text = result[0]['generated_text']
print(generated_text)

@RonanMcGovern @zbruceli

please any one tell, after making gguf file my models directory has following files.

ggml-vocab-llama.gguf
generation_config.json
pytorch_model-00001-of-00002.bin
pytorch_model-00002-of-00002.bin
pytorch_model.bin.index.json
tokenizer.model
ggml-model-f16.gguf
config.json
ggml-model.gguf

what are the necessary files for inferencing ? Actually it is taking same amount of RAM as without gguf. And how to do inferencing. I'm using following code for inferencing.

from transformers import pipeline
pipe = pipeline("text-generation", model="/content/drive/MyDrive/my_llama_cpp/llama.cpp/models")

@RonanMcGovern @zbruceli

@RadarSISA I use llama.cpp for inference, therefore I converted f16 gguf into q4.0 gguf. More details in my blog article
https://hackernoon.com/the-cheapskates-guide-to-fine-tuning-llama-2-and-running-it-on-your-laptop

I came across this discussion because I was experimenting fine tuning the llama2 model, and is now having some .bin files, that I want to convert to .gguf file format.

PROMPT> pwd
/Users/username/git
PROMPT> git clone https://huggingface.co/neoneye/llama-2-7b-simonsolver

I also had the problem with a missing tokenizer.model file.
I downloaded the tokenizer.model from the original model Llama-2-7b-chat-hf, and place it inside my own fine tuned model llama-2-7b-simonsolver

PROMPT> ls
llama-2-7b-simonsolver
llama.cpp
PROMPT> cd llama-2-7b-simonsolver
PROMPT> python3 ../llama.cpp/convert.py ./ --outtype f16
Loading model file pytorch_model-00001-of-00002.bin
Loading model file pytorch_model-00001-of-00002.bin
Loading model file pytorch_model-00002-of-00002.bin
… snip …
Wrote ggml-model-f16.gguf
PROMPT> ls -la ggml-model-f16.gguf 
13gb ggml-model-f16.gguf

PROMPT> cd /Users/username/git
PROMPT> ./llama.cpp/server --model llama-2-7b-simonsolver/ggml-model-f16.gguf
llama server listening at http://127.0.0.1:8080

Using the llama.cpp web ui, I can verify that the llama2 indeed has learned several things from the fine tuning.

My hello world fine tuned model is here, llama-2-7b-simonsolver.

Also huge thanks to @RonanMcGovern for great videos about fine tuning.

How can I run the convert on the fine tuned model?

(pt_source_2.0.1_cu12.2.1_535.86.10_cudnn8.9.5.29_intelpy310) vmirea@vmirea-Z390-GAMING-SLI:/media/vmirea/NTFS_8TB/projects/llama/llama-recipes/llm_qlora$ ls -la models/open_llama_7b_qlora_uncensored_adapter
total 49245
drwxrwxrwx 1 vmirea vmirea 4096 Oct 25 14:37 .
drwxrwxrwx 1 vmirea vmirea 0 Oct 25 13:03 ..
-rwxrwxrwx 1 vmirea vmirea 470 Oct 25 12:54 adapter_config.json
-rwxrwxrwx 1 vmirea vmirea 25234701 Oct 25 12:54 adapter_model.bin
-rwxrwxrwx 1 vmirea vmirea 25178112 Oct 25 14:37 ggml-adapter-model.bin
-rwxrwxrwx 1 vmirea vmirea 853 Oct 25 12:54 README.md
-rwxrwxrwx 1 vmirea vmirea 4091 Oct 25 12:54 training_args.bin
(pt_source_2.0.1_cu12.2.1_535.86.10_cudnn8.9.5.29_intelpy310) vmirea@vmirea-Z390-GAMING-SLI:/media/vmirea/NTFS_8TB/projects/llama/llama-recipes/llm_qlora$ python /media/vmirea/NTFS_8TB/projects/llama.cpp/convert.py models/open_llama_7b_qlora_uncensored_adapter/adapter_model.bin
Loading model file models/open_llama_7b_qlora_uncensored_adapter/adapter_model.bin
Traceback (most recent call last):
File "/media/vmirea/NTFS_8TB/projects/llama.cpp/convert.py", line 1208, in
main()
File "/media/vmirea/NTFS_8TB/projects/llama.cpp/convert.py", line 1157, in main
params = Params.load(model_plus)
File "/media/vmirea/NTFS_8TB/projects/llama.cpp/convert.py", line 292, in load
params = Params.guessed(model_plus.model)
File "/media/vmirea/NTFS_8TB/projects/llama.cpp/convert.py", line 166, in guessed
n_vocab, n_embd = model["model.embed_tokens.weight"].shape if "model.embed_tokens.weight" in model else model["tok_embeddings.weight"].shape
KeyError: 'tok_embeddings.weight'

There's a script included with llama.cpp that does everything for you. It's called make-ggml.py. It's based off an old Python script I used to produce my GGML models with.

https://github.com/ggerganov/llama.cpp/blob/master/examples/make-ggml.py
I guess that's the script. Thanks.

Sign up or log in to comment