Text Generation
Transformers
English
llama
sft
text-generation-inference

The model is broken

#2
by YAKOVNUKJHJ - opened

I tried several times, and it replies with blank text, producing a long loop of blank text, without any characters

On short questions, he was fine, once it was an answer that should be long, he just printed over and over, spaces, and empty paragraph breaks

I had the same problem. Longer answers became garbled for me. Solution is provided by comment made by TheBloke in second thread of discussion for this model.

In short: Current instructions in README for this model in how to use this model is faulty

This is 8K model therefore following line from README is wrong:

./main -t 10 -ngl 32 -m openassistant-llama2-13b-orca-8k-3319.ggmlv3.q4_0.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|system|>You are a story writing assistant<|prompter|>write a story about llamas<|assistant|>"

Instead your should use (info provided by TheBloke) and update it to :
./main -t 10 -ngl 32 -m openassistant-llama2-13b-orca-8k-3319.ggmlv3.q4_0.bin --color -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|system|>You are a story writing assistant<|prompter|>write a story about llamas<|assistant|>"

Does that fix the problem @dzupin ? You are right those params should be in there for 8K usage. But I did not think they were needed also for <4K usage.

I just tried ./main -t 10 -ngl 50 -m /workspace/openassistant-llama2-13b-orca-8k-3319.ggmlv3.q4_0.bin --color -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 --temp 0.7 --repeat_penalty 1.1 -n -1 -p with a long prompt (summarise an existing story) and it still returned nothing, like others are reporting.

It could be a problem with the custom tokens of this model in GGML - I have had reports of that before. I don't think there's anything I can do about that until GGML changes/improves its tokenisation. But there might be some llama.cpp options that would improve it; let me know if anyone finds anything.

Does that fix the problem @dzupin ? You are right those params should be in there for 8K usage. But I did not think they were needed also for <4K usage.

I just tried ./main -t 10 -ngl 50 -m /workspace/openassistant-llama2-13b-orca-8k-3319.ggmlv3.q4_0.bin --color -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 --temp 0.7 --repeat_penalty 1.1 -n -1 -p with a long prompt (summarise an existing story) and it still returned nothing, like others are reporting.

It could be a problem with the custom tokens of this model in GGML - I have had reports of that before. I don't think there's anything I can do about that until GGML changes/improves its tokenisation. But there might be some llama.cpp options that would improve it; let me know if anyone finds anything.

Well, thanks at least for all your effort, I download every day at least one new model to play with, never boring

YAKOVNUKJHJ changed discussion status to closed

@TheBloke
using parameters: -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 fixed several problems for me

e.g.
main.exe -t 16 -ngl 44 -m C:\AI\WIP\8K\openassistant-llama2-13b-orca-8k-3319.ggmlv3.q3_K_M.bin -s 7 --color -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 --temp 0.7 --repeat_penalty 1.1 -p "### Instruction: Explain in great detail the reason for number 42 to be an important number\n### Response:" -ins

fixed following problems for me:

  1. Same prompt with same seed produces longer more insightful/interesting output with 8k -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5

  2. I can use instruction mode ( -ins) to enter follow up requests in my prompt and model produce replies without any issues if I use parameters -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5
    But if I use only -c 4096 then only first reply is generated in -ins mode and after that only infinite loop of empty lines is generated by model. (-c 4096 is therefore unusable for me when parameter -ins is specified )

  3. Occasionally/randomly even prompt with only around 500 tokes, after initially correct output generates something like:
    previuoiououiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiui^C
    I have difficulty to reliable reproduce third issue. Most prompt works fine with -c 4096 or even with -c 2048, until they don't

When using -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 I never ran into garbled text problem

@TheBloke
using parameters: -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 fixed several problems for me

e.g.
main.exe -t 16 -ngl 44 -m C:\AI\WIP\8K\openassistant-llama2-13b-orca-8k-3319.ggmlv3.q3_K_M.bin -s 7 --color -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 --temp 0.7 --repeat_penalty 1.1 -p "### Instruction: Explain in great detail the reason for number 42 to be an important number\n### Response:" -ins

fixed following problems for me:

  1. Same prompt with same seed produces longer more insightful/interesting output with 8k -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5

  2. I can use instruction mode ( -ins) to enter follow up requests in my prompt and model produce replies without any issues if I use parameters -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5
    But if I use only -c 4096 then only first reply is generated in -ins mode and after that only infinite loop of empty lines is generated by model. (-c 4096 is therefore unusable for me when parameter -ins is specified )

  3. Occasionally/randomly even prompt with only around 500 tokes, after initially correct output generates something like:
    previuoiououiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiuiui^C
    I have difficulty to reliable reproduce third issue. Most prompt works fine with -c 4096 or even with -c 2048, until they don't

When using -c 8192 --rope-freq-base 10000 --rope-freq-scale 0.5 I never ran into garbled text problem

I work in LLamaSharp in C#, so none of this helps me... but thanks for the advice

@dzupin OK great, thanks for the details. I will update the README.

@YAKOVNUKJHJ I had a quick look at LlamaSharp and yes it seems they have not added RoPE yet.

By the way @dzupin I see you're using the wrong prompt template. At least, not the prompt template OpenAssistant say you should use. But I guess Alpaca still works, if that's working for you!

Have you tried the correct prompt template, <|system|>{system_message}</s><|prompter|>{prompt}</s><|assistant|> ?

@dzupin OK great, thanks for the details. I will update the README.

@YAKOVNUKJHJ I had a quick look at LlamaSharp and yes it seems they have not added RoPE yet.

Well thank you very much, I didn't know about that at all.
Thank you for all your efforts, I have no idea how you manage so much in a day

Sign up or log in to comment