MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF · Instruct versus non-Instruct

Apr 18, 2024

Are Instruct versions better for use with agents like gpt-pilot? How is function calling in 3-70B?

Apr 19, 2024

yes.
i couldn't run the 70b bigger than Q3 on my setup and its garbage.
I am working on a llama-cpp-python prompt template for function calling for llama 3.
I am getting good results with the 8B https://github.com/themrzmaster/llama-cpp-python

MaziyarPanahi

Owner Apr 19, 2024

I am actually going to test this at 16bit for function calling (first only instruction) this weekend. I'll update you here

zhouzr

Apr 19, 2024

Function call work well with Llama3

https://github.com/EvilPsyCHo/Play-with-LLMs/blob/main/examples/llama3-8b-Instruct-CoT-Agent-vllm.ipynb

MaziyarPanahi

Owner Apr 19, 2024

@zhouzr Nice! Let's see if we can do this one with Llama-3-70B: https://github.com/EvilPsyCHo/Play-with-LLMs/blob/main/examples/mistral-ReAct-Agent-with-function-tool-call.ipynb

SvenJoinH

Apr 19, 2024

Using the Llama 3 template in LM studio and adding the eos_token_id to 128009 did not help (model chats garbage). The 8B version works. Any ideas?

MaziyarPanahi

Owner Apr 19, 2024

I am download the Q2 in LM Studio and will get back to you. They both were made with the same Llama.cpp build. So they should either work or not. I'll test and come back to you

BigDeeper

Apr 19, 2024

•

edited Apr 19, 2024

I am testing directly with llama.cpp/main and it outputs responsive content for a while and then starts producing garbage.

p.s. using the 6 bit version.

SvenJoinH

Apr 19, 2024

I found this change to llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/6745
https://github.com/ggerganov/llama.cpp/pull/6745/files

Here are some files that has used this method:
https://huggingface.co/lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF

MaziyarPanahi

Owner Apr 19, 2024

Here is my 70B Q2 that I just downloaded:

Stops right, it's the last release of LM Studio from last night

MaziyarPanahi

Owner Apr 19, 2024

I found this change to llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/6745
https://github.com/ggerganov/llama.cpp/pull/6745/files

Here are some files that has used this method:
https://huggingface.co/lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF

As you can see from 8B discussions and my 70B screenshots, the quants for these models work perfectly. Those changes in the PR, is to make it easier for people who convert to pick the right BPE and add something to the Llama.cpp when it comes to Llama-3.

But you don't have to go with Llama.cpp default template taken from the tokenizer, you can provide it yourself as we do in LM Studio or manually in llama.cpp and it all work without any issue.

SvenJoinH

Apr 19, 2024

I download the one i whant and try again, maybe i downloaded from another user...=). But other seams to have the same problem so i am confused...I got the lmstudio.community to work at least.

SvenJoinH

Apr 19, 2024

Downloading and using Meta-Llama-3-70B-Instruct.IQ3_XS.gguf:

Fixing the eos_token_id to 128009:

Using the a gguf file from lmstudio-community:

MaziyarPanahi

Owner Apr 19, 2024

That's strange! Also, I don't do Fixing the eos_token_id to 128009: part.

Fresh LM Studio,
and any GGUF model

SvenJoinH

Apr 19, 2024

•

edited Apr 19, 2024

I have 0.2.20 aswell. I am downloading a non "i" quant, Can it be what different quants have different problems?

(but you said that you could load any gguf....)

MaziyarPanahi

Owner Apr 19, 2024

I have 0.2.20 aswell. I am downloading a non "i" quant, Can it be what different quants have different problems?

(but you said that you could load any gguf....)

That is possible! I don't try IQ models usually and go with _S or _M. Let me know how it goes, if the I is not good I'll make another one

SvenJoinH

Apr 19, 2024

Loading the Meta-Llama-3-70B-Instruct.Q3_K_S.gguf:

Fixing the eos problem:

Loading: Meta-Llama-3-70B-Instruct.Q3_K_S.gguf
Preparing to change field 'tokenizer.ggml.eos_token_id' from 128001 to 128009
*** Warning *** Warning *** Warning **
Changing fields in a GGUF file can make it unusable. Proceed at your own risk.
Enter exactly YES if you are positive you want to proceed:
YES, I am sure> YES
Field changed. Successful completion.

Seams to be the i matrix quants that are problematic?

MaziyarPanahi

Owner Apr 20, 2024

Interesting! I'll make an imatrix this weekend and redo the I quants with that again.

SvenJoinH

Apr 20, 2024

Maybe also try with the https://github.com/ggerganov/llama.cpp/pull/6745/files ?

MaziyarPanahi

Owner Apr 20, 2024

Could you please try this on the existing GGUF models?

./llama.cpp/main -m Meta-Llama-3-70B-Instruct.Q2_K.gguf -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>
\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024

The modifications introduced in that PR was to fix some issues in converting the model to GGUF, not the prompt template / tokenizer.

SvenJoinH

Apr 20, 2024

•

edited Apr 20, 2024

I get a trivial | was unexpected at this time.

I usally dont play with these tools so i am a bit lost.

MaziyarPanahi

Owner Apr 20, 2024

Here is a quick demo to show how to use it and you can see the response: https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing

SvenJoinH

Apr 20, 2024

•

edited Apr 20, 2024

Hmm ... changed '<|eot_id|>' to "<|eot_id|>" insteed. But it seams to work. please note i used my fixed eos gguf

So it works from the command line....

Hmm this was the K_S that worked all the time.....

The broken one (eos fixed) IQ3_XS:

MaziyarPanahi

Owner Apr 20, 2024

I have removed all the IQ quants from all my GGUF repos. I forgot to do an imatrix, so their quality was not good. The rest have been tested in Llama.cpp, llama.cpp-python, and LM Studio without changing metadata (with actual prompt) - similar to the demo in Colab.

BigDeeper

Apr 20, 2024

I am currently running gpt-pilot with the ollama import of Llama-3-70B. The model as imported is about 40GB. Ollama is able to distribute it over 4 GPUs with 12.2GiB VRAM each.

It seems to be "working" in the sense that it does similar thing that I saw when I ran gpt-pilot with OpenAI API. Sometimes it starts outputting junk, on the screen and into files, and I have to cancel and restart for it to behave reasonably again. The key here is that gpt-pilot is able to create files, and mostly is doing reasonable things. What I don't know is whether any of this code will work or not.

MaziyarPanahi

Owner Apr 20, 2024

@SvenJoinH I have now uploaded 5 new IQ quants for 70B based on the imatrix, they are pretty good. Even the IQ-1_S which is the smallest quants.

SvenJoinH

Apr 20, 2024

I am downloading now, if they work i (and others i assume) will be very thankful for you effort.

c-kunz

Apr 20, 2024

@SvenJoinH I have now uploaded 5 new IQ quants for 70B based on the imatrix, they are pretty good. Even the IQ-1_S which is the smallest quants.

Thank you so much!

I've been trying an IQ2-XS quant from some other repo, and it indeed works eerily good. I can't wait to get my hands on a functioning IQ3-XS as well, much appreciated!

MaziyarPanahi

Owner Apr 20, 2024

I have tested it both locally and in LM Studio. The one before was made without any imatrix so all the IQ- quants were bad. These ones are tested and they are up for the task:

yehiaserag

Apr 21, 2024

The same problem is happening with me for the Q4_M, it's unable to stop, so I dont think it's a problem with IQuants only

MaziyarPanahi

Owner Apr 21, 2024

@yehiaserag depending on where you use these GGUF models, if you don't follow the correct template it fails to stop. Here is a live demo with the smallest Q2 GGUF, downloaded right in the Colab, and you can see the response stopped perfectly fine. The important part is, I used the correct chat template and didn't rely on Llama.cpp to provide one for me:

https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing

phi0112358

Apr 21, 2024

Do I understand correctly that the error was/is due to the fact that an incorrect EOS was defined? Because the Q4_K_S model still shows EOS token = 128001 '<|end_of_text|>'; so incorrect, instead of <|eot_id|> as it should be correctly.

MaziyarPanahi

Owner Apr 21, 2024

On the application level, when you see <|eot_id|> string generated they all stop generating. If you define the stop string as es_token_id blindly, then yes that must be fixed. But if you just say I know <|eot_id|> is the stop string, just stop if you see it, then everything should be fine. (my example, LM Studio, etc.)

BigDeeper

Apr 21, 2024

•

edited Apr 21, 2024

So what is the final and the correct way to prompt these Q models? I'd like to use ollama to import a quantized model. Should the system prompt be used only once, or with every single query? Should "<|end_of_text|>" at all, if it is interactive?

Thanks.

MaziyarPanahi

Owner Apr 21, 2024

This is how I use it in Llama.cpp (in other apps the stop strings is set to ["<|eot_id|>", "<|end_of_text|>"] and it stops perfectly regardless of what's in the tokenizer.):

'<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi! How are you?<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024

https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing#scrollTo=R88QtCrMUraW

BigDeeper

Apr 21, 2024

Ok, but is it a single shot or interactive use? For a single shot you would pass your system prompt to the model every time. If one uses the model interactively, you want to pass the system prompt once. Also if one is using the model interactively, how does one end the assistant output? Or do you just leave it open?

MaziyarPanahi

Owner Apr 21, 2024

You follow the official template, always. The model will at some point when it thinks the generation is over generate the eos_token. If the software supports stop_strings or stop_squence_strings, by seeing that it will stop the generation. (which is the only way to stop any LLM from keep on generating)

This is a multi-turn:

Perhaps, it's best if you can share what it is you are doing so we can test it on our side and see why it fails

Nekotekina

Apr 21, 2024

Hello, can you make IQ quants of Non-Instruct 70B as well? I really want to test it but my own attempt of doing it failed.

MaziyarPanahi

Owner Apr 21, 2024

Hello, can you make IQ quants of Non-Instruct 70B as well? I really want to test it but my own attempt of doing it failed.

Hi,
Just out of curiosity, what are the use cases of the base model in GGUF? Can you fine-tune based on GGUF quant models?

Nekotekina

Apr 21, 2024

•

edited Apr 22, 2024

No, it isn't very useful, I'm just curious how it reacts to various prompting. It also seems uncensored unlike Instruct, in the case of 8B that is.

MaziyarPanahi

Owner Apr 22, 2024

No, it isn't very useful, I'm just curious how it reacts to various prompting. It also seems uncensored unlike Instruct, 8B that is.

That's OK, let me see what others have done in the last few days. I'll do the ones that are missing :)

Nekotekina

Apr 22, 2024

It also seems that ones made by https://huggingface.co/mradermacher doesn't work and outputs garbage.
https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF/discussions/2

MaziyarPanahi

Owner Apr 22, 2024

I suspect it's because it's a base model. The IQ-1 GGUF here actually works surprisingly good!

aaditya

Apr 23, 2024

How did you upload Q6 and Q8? I am trying to upload, getting a 50GB limit error. Can you share the script to split the shards?

MaziyarPanahi

Owner Apr 23, 2024

@aaditya you have to split anything that is larger than 48G limit of Hugging Face. You can do that with a simple split/cat in linux, or use the native split/merge in Llama.cpp: https://github.com/ggerganov/llama.cpp/discussions/6404

Nekotekina

Apr 23, 2024

•

edited Apr 23, 2024

I suspect it's because it's a base model. The IQ-1 GGUF here actually works surprisingly good!

~~I posted some tests in the discussion. It looks something is wrong with iMatrix.~~
Created an issue on github.
https://github.com/ggerganov/llama.cpp/issues/6841

skzz

Apr 23, 2024

Hi, I encountered the not stop generating problem with models smaller than 24GB by LMStudio. I managed to resolve the issue by adding the word 'assistant' as a stop string.

phi0112358

Apr 23, 2024

Hello, can you make IQ quants of Non-Instruct 70B as well? I really want to test it but my own attempt of doing it failed.

Hi,
Just out of curiosity, what are the use cases of the base model in GGUF? Can you fine-tune based on GGUF quant models?

Oh yes! You actually can finetune gguf quant models with llama.cpp

MaziyarPanahi

Owner Apr 23, 2024

damn!!! very interesting.
I found these:

Nekotekina

Apr 23, 2024

This comment has been hidden

MaziyarPanahi

Owner Apr 23, 2024

Hi, I encountered the not stop generating problem with models smaller than 24GB by LMStudio. I managed to resolve the issue by adding the word 'assistant' as a stop string.

Nekotekina

Apr 23, 2024

https://github.com/ggerganov/llama.cpp/issues/6804
Could be relevant, it seems that imatrix has some issues after all.

MaziyarPanahi

Owner Apr 23, 2024

Could be, I only use imatrix for IO- quants and not for Q- quants. I checked the IQ- models myself, the worked very well. (but I had the prompt template correctly)

BigDeeper

Apr 23, 2024

There is some problem with these quantized models. I was using the 6bit version, and no matter what format I was using for the prompt, it was getting into infinite loops. So I had to switch over to another repo.
I was using ollama to serve it.

The stuff below does NOT loop infinitely.

FROM /opt/data/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER num_gpu 73

SYSTEM "You are a helpful AI which can plan, program, and test."

MaziyarPanahi

Owner Apr 23, 2024

@BigDeeper
I don't know about Ollama, but it works fine in Llama.cpp (latest) and LM Studio (latest). Just make sure you have the latest version of these applications.

MaziyarPanahi changed discussion status to closed Apr 23, 2024