[DoggoEval] Test Results

#1
by SerialKicked - opened

DoggoEval : The Dog Persona Test

DoggoEval [Card] - Rex.png

With this test, I'm trying to determine the ability for the model to follow a card despite the user's actions and the natural inclination of a LLM (they love to talk and answer questions). In this case, our character is Rex, a German Shepherd. In a more limited way, it also allows me to quickly check a model's ability to compartmentalize actions and dialogs (and how varied are its responses).

Methodology

System prompt is kept simple on purpose (all required files and settings are located here). Bot is primed by 3 rounds of hard-coded normal owner-dog interactions (greeting, sit, treat) to put the model in the right "headspace". This dialog is the same for every model being tested to reduces noise. Chat history in SillyTavern format is available in the files.

Then it's asked the following 4 questions in this order.

  1. What time is it, Rex?
  2. What's the square root of Pi?
  3. What's your favorite color?
  4. You are visiting the Island of Knights and Knaves. Every inhabitant is either a Knight or a Knave, and never both. Everything a Knight says is true. Everything a Knave says is false. You meet a pair of Islanders, Alice and Bob. Alice says "Bob and I are both Knaves." What are they?

In practice the last one could be replaced by any logic test that you know the LLM has the correct answer for. The logic test must be several sentences long. As both LLama 3 8B and Mistral 7B can normally answer the question above very easily, it replaces my older query.

Tests can be consider full pass, partial pass, partial fail, fail. eg for the time question:

  • Pass would be barks, and actions not going further than "it's meal time!"
  • Partial Pass would be barks and actions where the dog looks at the clock
  • Partial Fail would be any precise time being written at any point
  • Fail is when the dog uses human language

Surprisingly (or not so much if you've been using neural nets for a long time), the first question is the hardest. It's a direct question that the LLM has seen billions of times in training data. Dogs have a concept of time (it's meal time, it's sleep time). Both elements may be stronger than the system prompt. In casual testing, it's the question triggering the most fails by far, and independently of the model being used.

User Card

I accidentally used my own card for this test. So if you want reproducible results you'll need to add a new user persona with the following info:

  • Name: EsKa
  • Description: {{user}} is a French video game developer and the author of a base-building game called After The Collapse.

Sampling Methods Used

  1. (Main - 5 points) Neutralized samplers + Temp 0. To see how the model behave in its natural state.
  2. (Main - 5 points) Author's favorite if any OR my default. It feels fair to consider the author's perspective.
  3. (Minor - 1 point) Test with high Repetition Penalty (which L3 models hate)
  4. (Minor - 1 point) Classic settings from Gryphe
  5. (Minor - 1 point) Universal Light ST default. Included in ST, working generally everywhere.

In 1) and 2) - Each question is worth a point. The last point is for overall output's quality, which is (mostly) subjective: varied barks, realistic actions/thoughts, and humor are favored.

I decided against using advanced sampling methods like Mirostats, Smooth Sampling or Dynamic Temperature as they add too many variables for me to consider. And in my experience, they rarely work well in long sessions. They still may be used in the "author's favorite".

It should be noted that samplers with Rep Penalty enabled (especially anything above 1.1) will make things a lot harder for the model (as it needs to know varied barks if it wants to follow its directive) and are the main cause of failure to use asterisks properly. Test could continue forever, but all models will end up failing at a point or another.

DoggoEval Results

  • Meta-Llama-3-Instruct_8K.Q8_0 (model page | GGUF) REFERENCE MODEL

    • 3.5/5 at temp 0. Mostly fail on 4. Mostly good on color (borderline good). Rest is pass. Surprisingly decent writing for a base model with temp 0.
    • 3.5/5 my settings. Pass first 2 questions. Mid for the rest. Half a point for good barks (not full point as the dog is too intelligent in actions).
    • 1.75/3 1st: mid (funny, but it ain't a doggo). 2nd: mostly fail (answer some questions in action) 3rd: pass and funny, again.
    • Note: This is the censored, default, LLama-3 instruct. And it does respectably well, and is trying to be funny at it. If you RP model does (much) worse than that, you may need to reassess your training method.
  • Rogue-Enchantress-7B-M0.2_ChatML_32K.Q8_0 (model page)

    • 4/5 Partial pass at Temp 0. Added an hour in parenthesis at the end of the 1st response. All other questions pass.
    • 4.5/5 Full pass with creator's settings (Temp 1, MinP 0.02). Remove half a point due to too much thinking in last question.
    • 1.5/3 Full fail if we count the question about time. Full pass if we don't.
    • Note: Only big problem is that it REALLY wants to answer the time question, like it practically overrides its whole personality for some weird reason. Woofs are not varied. Otherwise, very dog-like. Able to understand the actual limitations of a dog with author-recommended sampling method. Good model.
    • Side Note: I found many occurrences of <|system|> and <|user|> in output. That's not ChatML, so i suspect the model behaves worse than it should due to being a merge of different instruct formatted models. Doesn't have ChatML tokens either so it's wasting a lot of tokens just for formatting.
  • Stheno-L3-8B-v3.1_LLama3_8K.Q8_0-imat (model page)

    • 3.5/5 Partial Pass at Temp 0. Will tell the hour in the action text as if the dog could read it.
    • 5/5 Full pass creator's settings (Temp 1.12, MinP 0.075, TopK 40, Rep Pen 1.1).
    • 1.5/3 1st: Full pass, 2nd: partial pass, 3rd: partial fail (misuse of action to respond)
    • Notes: Bonus point for using a variety of different woofs, and barks and making me laugh once. Decently creative. Does okay as the test.
  • Poppy_Porpoise-v0.72-L3-8B_Llama3_8K.Q8_0-imat (model page)

    • 2/5 With temp 0. Stay in character, but answers questions (1,3,4) nonetheless, over-using actions.
    • 2.5/5 Roughly the same problem with creator's settings (Temp 0.95, MinP 0.03, SmoothFac 0.3, SmoothCurve 1.89)
    • 1.5/3 1st: fail, 2nd: mid (same pb as above), 3rd: success
    • Notes: Failed at properly using asterisks during the test. Made occasional weird noises for a dog ("Yip-yip-yip!" or "barking barking"). Dog wrote response on a paper one time to bypass prompt (dunno if i should count that as clever or not). Creative but the model is as dumb as a sack of bricks in regards to the test itself.
  • SOVLish-Maid-L3-8B_Llama3_8K.Q8_0 (model page)

    • 4/5 Mostly a pass at Temp 0. Actions are a bit too descriptive, but generally stays vague enough. Dog thinks a lot, but doesn't (attempt to) solve questions.
    • 4/5 No favorite sampler, using mine. Good all around except a real time is given in action for first question.
    • 2.5/3 1st: partial fail at last question, 2nd: full pass, 3rd: full pass
    • Notes: The dog getting annoyed by those weird questions in a few sampling methods (which is good). Decent variety of barks and growls. Solid.
  • Nyanade_Stunna-Maid-7B-v0.2-32K.Q8_0-imat (model page)

    • 5/5 Full pass at temp 0
    • 5/5 at recommended settings (temp 1.15, MinP 0.075). Interestingly, will fail completely if there's any rep penalty.
    • 1.75/3 1st: full pass, 2nd: Fail, 3rd: partial success
    • Note: Like, the other mistral models, it REALLY loves to hallucinate an answer to the time question. Otherwise it's very good at following context. It's not creative, however. Like most mistral based models, it likes a relatively high RepPenalty to balance it out.
  • Llama-3-dragonmaid-8B-v2_ChatML_8K.Q8_0 (model page)

    • 4.5/5 at temp 0, good description and appropriate use of quotes within actions. Did look for a clock in 1, but didn't go farther than that. Repetitive output, but that's temp 0 for you.
    • 3.5/5 no preset, using mine. Partial pass on 1, partial fail on 3. Bonus for varied barks, the dog getting annoyed and overall output text quality is pretty decent.
    • 1.75/3 1st: partial pass (fail only on 1). 2nd: partial fail (1, 3), rest ok. 3rd: mostly a pass (color is debatable).
    • Note: Apologies for the mishandling of the first test.
  • Pantheon-RP-L3-8B-1.0_ChatML_8K.Q8_0 - Using ChatML (model page)

    • 1.5/5 at temp 0. Looks at clock and speaks for color and logic puzzle.
    • 2.5/5 with author preset (temp 1, repP 1.05 topP 0.95, topK 40, minP 0.05). Good start, but partial fail on 3 and 4.
    • 0.5/3 1st: Fail. 2nd: Fail. 3rd: Partial pass.
    • Note: It really wants to answer the color question more so than anything else, which is a behavior unique to this model so far. Given it's using one of my selected presets as author favorite. 2 is gryphe's and 4 is the one I use when the author doesn't give one. Model ain't as bad as the values would indicate, it writes quite well, but it's clear it's way more comfortable with its own preset characters.
    • Side-note: Using ChatML in a L3 model is heretical, but it's tokenized, so it's not wasting any tokens.
  • Dolphin-2.9.1-L3-8B_ChatML_8K.Q8_0 - Using ChatML (model page)

    • 3.5/5 at temp 0. Good on first 2 questions. Partial fail on the next two. But, at least was funny about it, gets small bonus for that.
    • 3/5 no author preset, using mine. Fail the color thing (talk). 1, 2 are pass. 4 is mid. Removed half a point for fucking up syntax in the color question.
    • 1.75/3 1st: Pass (i'll let 4's fly due to being hilarious). 2nd: partial pass (fail and fucked up format at 4, the rest is very much a proper dog). 3rd: partial fail (especially 4 but funny).
    • Note: Another one using ChatML in a model that already has tokens for prompting. It's tokenized as well, so it's not so bad. It's not a RP model, yet it manages to output 'intentionally' funny answers, which most RP models fail at. Would work wonders for a cartoon dog.
  • SOVL-Mega-Mash-L3-8B_LLama3_8K.Q8_0 (model page)

    • 5/5 at temp 0. Full pass, real dog. Somehow managed to output decently varied answers.
    • 4/5 author preset (not trying them all to find the best, that'd be cheating). mostly pass for 1 (looking for clock). 2 and 3 pass. 4 solving the riddle in action (mostly fail). Bonus for varied/decent writing and dog behavior.
    • 2.25/3 1st: pass except 4 (meh for color) - 2nd: same as 1 - 3rd: full pass
    • Note: Model is too clever for its own good and really want to answer question 4. It always does it in actions, so not as bad. Beside that "issue", it's real good, especially for a big merge.
  • Kunoichi-Lemon-Royale-v2-7B_ChatML_32K.Q8_0 (model page)

    • 5/5 at temp 0. Full pass. Good understanding of a dog's physical and mental limitations.
    • 4.75/5 my settings. Full pass. Good understanding again. Woofs are all the same but the rest is varied enough.
    • 2.5/3 1st: pass - 2nd: mostly pass (Rex is very proud of his ability to count hours) - 3rd: mostly pass (time again)
    • Note: Nothing to add here, it's a very good merge using very good parent models. Like all Mistral models, it's a bit obsessed with time, but even the dog is surprised about it.
  • Poppy-Porpoise-0.85-L3-8B_LLama3_8K.Q8_0-imat (model page | GGUF)

    • 3.5/5 at temp 0. Partial pass on 1. Partial fail on color. Decent quality output.
    • 3/5 creator's settings. partial fail on 1 and 3. okay output quality.
    • 1.5/3 1st: mostly fail (answers most questions in actions. asterisk issue at high rep pen, expected). 2nd: pass (I will let color slide as the dog's favorite color is whatever is user's). 3rd: mostly fail (answers most questions in actions).
    • Note: Notable improvement over v0.75. Better writing. Less asterisk related issues.
  • Mahou-1.2-llama3-8B_ChatML-Named_8K.Q8_0-imat (model page | GGUF)

    • 3.25/5 temp 0. mostly pass because of unfinished output sentences (see side notes). Failed asterisk formatting in each question, somehow (that's a first).
    • 4.25/5 my settings. Pass. Additional asterisk at the end of 2 and 4. Okay style otherwise.
    • 2.5/3 1st: Pass (again, asterisk issue in q2) 2nd: Pass (q4 ending asterisk) 3rd: Mostly pass (asterisks issue too)
    • Note: Too bad it's completely unusable for me (single line models are a no go for most of my tests), It would have been a decent work horse (i also wanted to compare it to its Yi variant)
    • Side-Note: Completely untokenized ChatML variant on a L3 model. Relies on arbitrary stopping characters to reliably stop generation (including "\n").
  • Mahou-1.2-llama3-8B_ChatML-Named_8K.Q8_0-imat - Using L3 instead, given that it's literally tokenized for it

    • 5/5 temp 0. Full pass. Knows limitations of a dog.
    • 5/5 my settings. Full Pass. Knows limitations of a dog, good writing.
    • 2.75/3 1st: Mostly pass (added author note at 4, otherwise good doggo) 2nd: Pass. 3rd: Pass.
    • Note: All asterisks issues from previous test, poof gone. Excellent output overall.
    • Side-Note: Are you SURE you trained it for your custom variant of ChatML? Because i don't believe it's the case.
  • Halu-8B-Llama3-v0.3_Llama3_8K.Q8_0 (model page | GGUF)

    • 3/5 at temp 0. Mostly fail on color and 4. rest is pass.
    • 3.5/5 my settings. Mostly fail on 4. Color is mostly pass. rest is pass. okay writing.
    • 1.25/3 1st: talking dog (fail), funny output, though. 2nd: mostly fail (answers questions in actions) 3rd: mostly pass.
    • Note: If I had to guess, model is doing very well in standardized testing. Hence the super-intelligent dogs. Still, the model understands the prompt, it just want to answer the questions so badly it works around it. Still hopeful this one is gonna behave in test 2.

Other Models not qualifying (tested for personal reasons and left here for me to keep track):

  • Hercules-5.0-L3-8B_ChatML_8K.Q8_0: It's not a RP model but had function calling. Failed all tests (partial fail at some). First dog who tries to answer the square root of Pi question which I normally consider a free point. Sad, but expected. I'm sure it's a very good model otherwise. Still, likely prone to hallucinating answers for questions it doesn't know the answer to.

DoggoEval: A few example outputs

I decided against copy pasting all the models' results as it makes the post way too big.

Example: Rogue Enchantress knows what a dog is
firefox_ng43bVBsT0.png

Example: DragonMaid tested after I finally got to sleep.
firefox_f4IMeM0aTv.png

Example: Dolphin beating RP models at RP'ing 1 (Chain of thought type output in dog talk for last question, love it)
firefox_8ZPwpqnCIx.png

Example: Dolphin beating RP models at RP'ing 2
firefox_NPhR0vevtw.png

Added @Nitral-AI 's newest Poppy-Porpoise-0.85-L3-8B. Pretty decent improvement over the previously tested v0.75.

Added flammenai's (i'll just ping @nbeerbower ) Mahou-1.2-llama3-8B_ChatML-Named_8K.Q8_0-imat. Tested both in the ChatML variant, and the L3 Instruct format. For this one, I really want an explanation:

Mahou 1.2 in Named ChatML (That's the 5th sampler. By far best the best output out of the 5):
firefox_hwlaCvcPvB.png

Mahou 1.2 in L3 (same sampler):
firefox_F3bB9XWYqo.png

Are you SURE your trained it for that named-ChatML variant? Because it happily generated L3 EOS on its own, and didn't need custom stopping strings here.

Well 0.85 is trained on L3-instruct for certain, cant speak on the other models. Glad to hear you enjoyed the changes in 0.85 however :)

Well 0.85 is trained on L3-instruct for certain, cant speak on the other models. Glad to hear you enjoyed the changes in 0.85 however :)

Yes, I know, your model is clearly L3-instruct and did quite decently. I was just asking the other person.

The strings are formatted for ChatML, but the EOS does come from the L3 tokenizer.

Sorry, still figuring all this out... 😅

The strings are formatted for ChatML, but the EOS does come from the L3 tokenizer.

It's more the obligatory single character stopping strings that make me wonder if something is wrong.

Sorry, still figuring all this out... 😅

No worries. 😁

If we ignore the instruct format shenanigans, the L3 output is absolutely gorgeous. In other (private) tests both the models I tested from you are quite good with complex scenes as well.

Yeah seems like Llama 3 with a high MMLU score is good at just "getting it."

I'm not sure if the stopping strings are really even necessary, but they kept responses short for gathering training data.

The strings are formatted for ChatML, but the EOS does come from the L3 tokenizer.

Sorry, still figuring all this out... 😅

It seems most quantized llama3 models trained on other prompt formats somehow lose it.
I've had issues with it a few times.

ChatML may work in fp/bf16 but I don't have the vram to test it.

CognitiveComputations have been trying to make ChatML work with llama3 ggufs too
cognitivecomputations/dolphin-2.9.1-llama-3-8b

Yeah, I tested Dolphin, pretty decent writing and score for a non RP model. Main difference is that their ChatML is fully tokenized and its EOS token is properly setup for ChatML. Mahou 1.2 (and 1.2a) has no ChatML EOS token, hence why it has to rely on a list of custom characters to stop the model from infinitely generating, and ends up behaving better in L3-Instruct.

Edit: Btw, In Mahou's tokenizer_config file, the "chat_template" field also insists on using L3 instruct. Not idea how automated testing works, but i guess it's using this field to determine the format its being tested in?

Added @Hastagaras 's model

Like many models scoring super high in standardized test, it makes the model want to answer the questions so bad it overrides its card, leading to super-dog behavior.

  • Halu-8B-Llama3-v0.3_Llama3_8K.Q8_0 (model page | GGUF)
    • 3/5 at temp 0. Mostly fail on color and 4. rest is pass.
    • 3.5/5 my settings. Mostly fail on 4. Color is mostly pass. rest is pass. okay writing.
    • 1/3 1st: talking dog (fail), funny output, though. 2nd: mostly fail (answers questions in actions) 3rd: mostly pass.

A few screenshots:

firefox_Cppf7CTaKi.png Not the win you think it is, Rex 😉
firefox_omu4MacdAh.png Acceptable answer for the time of day.
firefox_i1n89JZ7hb.png Nah, sadly not, doggo. But your tried.

(ok, i really need to start using a chart now. I'll get that done this week-end)

Thank you for taking the time to evaluate my model! Your feedback is very valuable for my future models!

(ok, i really need to start using a chart now. I'll get that done this week-end)

Here's a rough chart for now - I didn't rank them, they're just in order of being added (I've attemped a new merge with L3-Daredevil, it has the highest MMLU for llama3. I wonder if MMLU truly is the key to understanding)

Tests: 1. 2. 3. Average.
Rogue-Enchantress-7B-M0.2_ChatML_32K.Q8_0 4/5 4.5/5 2.5/5 3.66/5
Stheno-L3-8B-v3.1_LLama3_8K.Q8_0-imat 3.5/5 5/5 2.5/5 3.66/5
Poppy_Porpoise-v0.72-L3-8B_Llama3_8K.Q8_0-imat 2/5 2.5/5 2.5/5 2.33/5
SOVLish-Maid-L3-8B_Llama3_8K.Q8_0 4/5 4/5 4.175/5 4.05/5
Nyanade_Stunna-Maid-7B-v0.2-32K.Q8_0-imat 5/5 5/5 2.92/5 4.3/5
Llama-3-dragonmaid-8B-v2_ChatML_8K.Q8_0 4.5/5 3.5/5 2.92/5 3.64/5
Pantheon-RP-L3-8B-1.0_ChatML_8K.Q8_0 Using-Chat-ML 1.5/5 2.5/5 0.835/5 1.61/5
Dolphin-2.9.1-L3-8B_ChatML_8K.Q8_0 3.5/5 3/5 2.92/5 3.14/5
SOVL-Mega-Mash-L3-8B_LLama3_8K.Q8_0 5/5 4/5 3.75/5 4.25/5
Kunoichi-Lemon-Royale-v2-7B_ChatML_32K.Q8_0 5/5 4.75/5 4.175/5 4.64/5
Poppy-Porpoise-0.85-L3-8B_LLama3_8K.Q8_0-imat 3.5/5 3/5 2.5/5 3/5
Mahou-1.2-llama3-8B_ChatML-Named_8K.Q8_0-imat Using-Chat-ML 3.25/5 4.25/5 4.175/5 3.89/5
Mahou-1.2-llama3-8B_ChatML-Named_8K.Q8_0-imat Using-L3 5/5 5/5 4.59/5 4.86/5
Halu-8B-Llama3-v0.3_Llama3_8K.Q8_0 3/5 3.5/5 2.08/5 2.86/5
  • 3 Was converted from -/3 to -/5 by multiplication of 1.67

Thanks, mate. I'll add your model to the list and get to it. But, like with the new v3 of Kunoichi-Lemon-Royale, I'd rather diversify the authors a bit to get a more representative panel. If i don't reach 20 different models before I'm ready to go for the second test, I'll include your new model and @grimjim 's to fill the blank.

edit: I can't really say either way when it comes to the relation between this test and standardized ones. The only thing I can say for sure at this point, is that RP models do better than non RP ones. The Hercules 5.0 model i put below the list is just one of the many example of non RP + big brained models that didn't pass even in the most generous interpretation of the term. If I was forced to guess, I'd say models over-fitted to score high to synthetic tests will perform worse here (but i really don't know).

edit 2: Added meta's LLama3 model as a reference point. Yes, the 'censored' one. Its output is near identical to Dolphin (between 3 and 3.5 points, margin of error/tastes. Decent at this test and with humor).

Sign up or log in to comment