unpopular feedback after some test.

#1
by Ks01 - opened

Dawn v2 wasn't what I expected.
It didn't follow complex prompts (ex. RPG status in character prompts).
It gave a short answer compared to other models, with no detailed depiction.
Character role-play was not bad, but not extraordinary either.

Compared to what I experienced with your 13b and 20b models, it was an unexpected result.
I know Dawn 0.1 is a failed model, but somehow it worked better on complex prompts, and the output was even better than using Xwin alone.

After a few more tests and trying what's going on with Dawn v2 myself, I found it didn't take advantage of good models.

ORCA_LLaMA_70B
airoboros-l2-c70b
Nous-Hermes-Llama2-70b
Samantha-1.11-70b

Those are the underwhelming models that I tested before. I can't say much about qCammel since I didn't test it, but I know those four models were not good at understanding prompts.
ORCA_LLAMA is not good at all at RP, even with its high score in the Ayumi leaderboard.
Airoboros 3.1.2 was a bit underwhelming compared to 2.2.1. It gave me a short output and an extremely short depiction. It did great on role-playing, but 2.2.1 performed better on both understanding complex prompts and RP.
Nous-Hermes-Llama2 is good at writing but has bad intelligence. But Dawn didn't give enough output to show its strength. It feels like it took bad intelligence, not strength.
Samantha is a model that has intelligence like ORCA_LLAMA or FashionGPT but is not suitable for RP.

I just assume that's the reason why it's not working as I expected.
Maybe someone found more interesting strengths in this model, so it's just unpopular feedback.
Thanks for experimenting on 70B though. I hope you keep interested in 70B, not only 7b or 13b.

Hi!
It's okay, a feedback is a feedback, and I thank you for the time you took to make it.
Dawn v1 is usable, but as you know it was just LimaRP on top of Xwin, and not something I wanted.
If this model don't serve you well it's okay, I will do better next time, I'm currently still working on making usable EXL2 quant, so I will be able to check the real limit when I will succeed, because f16 and GGUF are really too slow for me, but I found the output to be correct atm.
I keep what you said in mind about the choice of model if I do a v3 someday!

My experience is the diametrical opposite--and I just started using SillyTavern, so I'm sure the prompting, etc. is suboptimal. Nevertheless, its prose was orders of magnitude better than some other 70b models I've tried, especially with metaphors (could be in part that I started using sillytavern, tho). Did OP specify whether they were using a quant or not?

There is AzureBlack/Dawn-v2-70B-exl2 (IIRC a full spectrum of quants are in there)--you don't have to quantize nothin'--unless you do it for the thrill of it. ;-)

Frankly, I suspect that part of the reason it was better had to do with sillytavern vs. koboldai and the flexibility in prompting it, but I doubt that was all of it.

Oh, BTW: IMHO you should provide specific settings for samplers, etc. or maybe even a preset file for people to try with the model. That way, you'll at least have SOME independent variables.

Oh, Also, you might want to try a merge like this one except with chronos-70b substituted for one or more of the more generic instruct models.

Its early days for me. I found it wasn't that inspiring for chat, chat-instruct on ooga but when I switched to using alpaca instruct format with a bit of prompt tweaking it was doing really well, hardly had to regenerate at all so real plus there!

Sign up or log in to comment