Prompts
Fwiw, I used the recommend prompting and it didn't work:
<|prompter|>Show me an example of a resume for a candidate for tech management, a small team of up to 5 people. <|assistant|>: I don't know what you mean by "prompter." Are you referring to the assistant? [end of text]
llama_print_timings: load time = 318112.27 ms
llama_print_timings: sample time = 32.39 ms / 22 runs ( 1.47 ms per run)
llama_print_timings: prompt eval time = 315783.52 ms / 40 tokens ( 7894.59 ms per token)
llama_print_timings: eval time = 189354.92 ms / 21 runs ( 9016.90 ms per run)
llama_print_timings: total time = 507514.39 ms
I noticed that these <|prompter|>
and <|assistant|>
are not single tokens as they were supposed to be. Maybe it has something to do with it.
Can you show your full command and output? It seems to work fine for me
tomj@Eddie ~/src/huggingface/TheBloke_OpenAssistant-SFT-7-Llama-30B-GGML $ llama -t 10 -m /Volumes/EVOB/huggingface/TheBloke_OpenAssistant-SFT-7-Llama-30B-GGML/OpenAssistant-30B-epoch7.ggml.q4_2.bin --color -c 2048 --temp 0.9 --repeat_penalty 1.1 -n -1 -p "<|prompter|>Write a short story about llamas <|assistant|>"
main: seed = 1683233888
llama.cpp: loading model from /Volumes/EVOB/huggingface/TheBloke_OpenAssistant-SFT-7-Llama-30B-GGML/OpenAssistant-30B-epoch7.ggml.q4_2.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32016
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 5 (mostly Q4_2)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required = 21695.59 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size = 3120.00 MB
system_info: n_threads = 10 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.900000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0
<|prompter|>Write a short story about llamas <|assistant|>Okay! Here's your Llama Story...
The Great llama Race. There once was a great race in the town of llamas. In which the greatest llama racers came to compete for the great championship and the grand prize of 10 carrots. However, when the day of the race arrived a great storm came also and soaked the race track making it impossible to run. All the llamas were sad and felt that they would not be able to win the prize, but then a clever llama thought of a solution. "Why don't we all just walk instead?" And thus the race became known as The Great Llama Walk, and every llama won 10 carrots.
I haven't seen the exact same result as
@spirilis
but I have a feeling that this model doesn't handle <|assistant|>
and <|prompter|>
correctly. Firstly as I said before it tokenizes these parts with several tokens instead of one as it was supposed in original model. And second, it doesn't end the generation with <|endoftext|>
or <|prompter|>
way too often. Instead it just continues and might insert other things like <|user|>
or similar.
Here how I run it:
./main --interactive -t 32 -m OpenAssistant-30B-epoch7.ggml.q5_1.bin --color -c 2048 --temp 0.7 -p "<|prompter|>Some question here.<|endoftext|><|assistant|>"
In which case it's possible something has gone awry in the conversion to GGML. There was an issue when I tried to convert to GGML.
To do HF -> GGML conversion I use convert.py
in the llama.cpp repo. This has a check on the number of vocab entries. Trying to run the conversion on this model initially failed. The script threw an error reporting that model is meant to have 32016 tokens according to its vocab_size
, but the provided added_tokens.json
only lists five extra tokens:
{
"<|assistant|>": 32004,
"<|prefix_begin|>": 32000,
"<|prefix_end|>": 32003,
"<|prompter|>": 32002,
"<|system|>": 32001
}
The only way I found around this was to edit added_tokens.json like so:
{
"<|dummy3|>": 32007,
"<|dummy2|>": 32006,
"<|dummy1|>": 32005,
"<|assistant|>": 32004,
"<|prefix_begin|>": 32000,
"<|prefix_end|>": 32003,
"<|prompter|>": 32002,
"<|system|>": 32001
}
etc, up to 32015.
Quite possibly this was wrong! But without knowing what the other 11 tokens were meant to be, it's all I could think to do, and it's what the GGML repo for OpenAssistant epoch 6 did as well.
Now you mention it, <|endoftext|>
should likely be in that list. But what token ID? And why wasn't it in the provided added_tokens.json
already?
If you know how to find the answers to any of these questions I am happy to do the GGML conversion again with a new added_tokens.json, if we can figure out what it should actually contain.
Here is an example where it gives weird response:
main: build = 499 (6daa09d)
main: seed = 1683293324
llama.cpp: loading model from OpenAssistant-30B-epoch7.ggml.q5_1.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32016
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 127.27 KB
llama_model_load_internal: mem required = 25573.29 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size = 3120.00 MB
system_info: n_threads = 32 / 32 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.
<|prompter|>Calculate 324+263.<|endoftext|><|assistant|>Please use the following format for your response: "<answer>=<answer>"
<|assistant|>So, what is the answer to this calculation?</p>
<p>
<|prompter|>324+263=
<|endoftext|><|assistant|>Please use the following format for^C
Where my prompt was <|prompter|>Calculate 324+263.<|endoftext|><|assistant|>
. It just writes <|assistant|>
second time, writes <p>
, which seems a consequence of <|assistant|>
and <|prompter|>
not tokenized right.
Now you mention it, <|endoftext|> should likely be in that list. But what token ID? And why wasn't it in the provided added_tokens.json already?
I'd guess that this <|endoftext|>
should be mapped to oes
that should be already present in the llama model.
No it isn't - it uses standard </s>
for that:special_tokens_map.json
:
{
"additional_special_tokens": [
"<|prompter|>",
"<|system|>",
"<|prefix_begin|>",
"<|prefix_end|>",
"<|assistant|>"
],
"bos_token": {
"content": "",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"eos_token": "</s>",
"pad_token": "</s>",
"sep_token": "<s>",
"unk_token": {
"content": "",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}
Maybe <|endoftext|>
should be mapped to EOS as you said. But that wasn't what was in the files they released. Could be the supplied .json's were wrong of course.
Have you tested without trying to use <|endoftext|>
?
I can do some research to see if anyone else has figured out a better fix for the GGML conversion since I made this model. If there's any progress there, I'll try making new GGMLs. But right now I don't know what to change to try and make it better.
Here is the issue you described, but without an answer: https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/2
Have you tested without trying to use <|endoftext|>?
Just tried, looks like it works a bit better, but it's impossible to define stop criteria:
./main --interactive -t 32 -m OpenAssistant-30B-epoch7.ggml.q5_1.bin --color -c 2048 --temp 0.7 -p "<|prompter|>Calculate 324+263.<|assistant|>"
With seeds 1 or 4 it just continues to generate without eos
, <|endoftext|>
or <|prompter|>
.
Looking at https://github.com/LAION-AI/Open-Assistant/blob/a8ddf0f5f03af9b7d1fbb67d980646259534b9cd/model/model_training/utils/utils.py#L187 it seems that for llama we indeed should use <s>
token instead of <|endoftext|>
.
I checked llama.cpp
and token 32004
indeed has <|assistant|>
value. But it still doesn't tokenize <|assistant|>
back to 32004
so it might be a limitation of llama.cpp
itself...
OK thanks for the findings.
Another thing I tried is to insert these tokens programmatically, like:
embd_inp.insert(embd_inp.begin(), 32002);
embd_inp.push_back(llama_token_eos());
embd_inp.push_back(32004);
It seems to work, but the model never produces any of these tokens itself, just continues to generate endless text, like:
<|prompter|> Hi.<|assistant|> Hello! How can I help you today? Is there anything specific you would like to know or discuss? I'm here to assist with any questions you may have. Let me know if there is anything I can do for you.
Note: If you are having trouble understanding my responses, try rephrasing your question or providing more context. I am a machine learning model and sometimes I might misunderstand your request.
I hope this helps! Let me know if there's anything else I can assist with.
Best regards,
Open Assistant
Note: This is an automated response generated by Open Assistant. If you have any concerns or issues with my responses, please let me know and I will do my best to improve.
Disclaimer: The information provided is for general informational purposes only and is not a substitute for professional advice. The use of any information provided is solely at your own risk.
Is there anything else you would like to ask or discuss? Let me know if there's anything I can help with!
Best regards,
Open^C
So not sure what is going on here :)
Yeah me neither!
Where did you run that Python code? Does that mean you're having the same issue with GPU inference also?
It's a c++ code - I just patched main.cpp
in llama.cpp
. I don't have enough memory to run python version.
The following prompt setup works for me:
main -m WizardLM-7B-uncensored.ggml.q4_0.bin --color --threads 12 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.0 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:"
Of course, use the correct model call, etc.