It's a runaway train at times...

#2
by mancub - opened

Using llama.cpp in WSL...been playing with this model and for the most part it works as expected when I run it with ### Instruction: ### Response:

Although on some questions it just goes ballistic and won't stop. It keeps repeating the answer, just differently each time, but unless I CTRL+C out, it would keep going until the world ends.

Also, is there a way to run in llama.cpp interactive mode (for example with Wizard-Mega model it was pretty neat as it remembered the prior conversation) ?

I tried with --interactive-first and using -r "user " as a prompt for it to stop, but it only works the first time. After that it keeps repeating the "user" with some random question (though it does not answer it). Plain -i(nteractive) does not work and it just takes off here writing code in various languages. It's amusing but not useful. :)

I also noticed that if I ask the same question in the --ins(truction) mode it gives me a different (and apparently longer) answer than with the --interactive-first mode.

Hi mancub,

did you already play around with "temperature" and "repeat_penalty" ?

kind regards

With --temp 0.7, --repeat_penalty 1.7 and --top_p 0.7, I have no problem using it with --interactive-first mode with -r "User:" (Note the colon here).

Yeah many people have been reporting this follow-on-answer issue.

I've just updated the main branch with new ggmlv3 models from the latest version of Manticore, epoch 3

I have tested 10 prompts and haven't got a single follow-on answer

@mancub and others, please re-test and let me know if you spot a difference.

Hi,

This is my command line:

GGML_CUDA_NO_PINNED=1 ./main -t 10 -ngl 40 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -m models/thebloke_manticore-13b.ggml.v3.q5_1.bin --interactive-first -p "User "

I don't have a column in the prompt User, but does it really matter, it's just a stopword?

Yeah many people have been reporting this follow-on-answer issue.

I've just updated the main branch with new ggmlv3 models from the latest version of Manticore, epoch 3

I have tested 10 prompts and haven't got a single follow-on answer

@mancub and others, please re-test and let me know if you spot a difference.

@TheBloke

Should that be the version number though, because yesterday it was v3 when I downloaded it ( 12 hours ago, and now I see the files has been updated 1 hr ago)?

Perhaps make v3.x, or add a date e.g. Manticore-13B.ggmlv3.20230520.q5_1.bin

Firstly, I now see the issue described when I use your command line. Using --interactive-first-p User it won't shut up.

I was testing with single prompts, like:

-p "###Instruction: explain in detail the differences between C, C++ and Objective C\n### Response:"
-p "###Instruction: what is pythagorus theorem? Give some examples\n### Response:"
-p "###Instruction: Write an essay comparing France and Germany\n### Response:

And it works perfectly with those.

So it can work with the right prompt template, but there does seem to be an issue with the USER: prompt.

Secondly, the ggmlv3 refers to the GGML version number, ie indicates its compatibility with llama.cpp. It's not the version of the model specifically. When I did the new GGMLv3 models last night I forgot to update the base Manticore model to its latest version, so this latest push fixes this.

I'd recommend re-downloading the latest pushed files for general quality, but it won't fix this prompt template issue.

I'll update the README to mention this is now epoch 3.

mancub changed discussion status to closed
mancub changed discussion status to open

@)#(%#$ fingers faster than the brains sometimes.

I re-downloaded the model, though the filesize is the same as what I had prior.

Not sure what it is, but this version goes all biased and restricted on me, more often than before. The previous one did not make an issue of me asking it anything. It still goes on rambling at the end first with some fluff self-chatter (overly cordial) and then just keeps going on repeating what it previously wrote. This is in the interactive mode.

With Instruction/Response it is concise, but totally incorrect. Matter a fact it's restricted and refuses the provide a real response like it did previously.

it also seems slower by 10ms or so than the previous one, according the to figures I get after run completes (looking at eval time, 100ms/t now vs 90ms/t prior).

I can't load the new model with the updated llama.cpp:

(base) :~/llama.cpp$ ./main -t 4 -m ./models/Manticore-13B.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -ngl 40 -p "### Instruction: ### Response:"
main: build = 552 (b5c9295)
main: seed = 1684599696
llama.cpp: loading model from ./models/Manticore-13B.ggmlv3.q4_0.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file?
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/Manticore-13B.ggmlv3.q4_0.bin'
main: error: unable to load model

@MatsuCA Firstly please double check the SHA256SUM matches what's shown on the file page - maybe the file download didn't complete, or the file got corrupted.

If that matches, please double check you're definitely using the latest llama.cpp, recompiled. I've tested the models on Linux and macOS and they work fine.

I downloaded both q5_1 and q8_0 now and I notice a slight difference in speed.

For example using:
-p "###Instruction: explain in detail the differences between C, C++ and Objective C\n### Response:"

q5_1:
llama_print_timings: load time = 6902.60 ms
llama_print_timings: sample time = 500.14 ms / 805 runs ( 0.62 ms per token)
llama_print_timings: prompt eval time = 638.39 ms / 25 tokens ( 25.54 ms per token)
llama_print_timings: eval time = 89991.04 ms / 804 runs ( 111.93 ms per token)
llama_print_timings: total time = 97673.19 ms

q8_0:
llama_print_timings: load time = 9330.53 ms
llama_print_timings: sample time = 397.94 ms / 641 runs ( 0.62 ms per token)
llama_print_timings: prompt eval time = 633.99 ms / 25 tokens ( 25.36 ms per token)
llama_print_timings: eval time = 67060.31 ms / 640 runs ( 104.78 ms per token)
llama_print_timings: total time = 77010.65 ms

This is somewhat surprising to me because a larger model appears faster, but I don't know enough about all of this to make any determination. Besides, I am not complaining 10t/s is great!

@MatsuCA Firstly please double check the SHA256SUM matches what's shown on the file page - maybe the file download didn't complete, or the file got corrupted.

If that matches, please double check you're definitely using the latest llama.cpp, recompiled. I've tested the models on Linux and macOS and they work fine.

I could not find the source of the issue. So I reseted my WSL instance and installed llama.cpp from zero. All working now, thank you.

Sign up or log in to comment