Feedback

#1
by MarinaraSpaghetti - opened

Hey guys, firstly, thank you for the model! You're doing an amazing job out there, and I've heard many praises about the bigger Magnum, but sadly, it's a bit too big for me to run (24 GB crowd, but preferring longer contexts). I was thrilled to try out this one, but sadly, I found it a bit… lacking.

When compared to the Nemo Instruct, it's considerable less smart, regardless of the temperature (I was testing it with different Temps ranging from 0.5 to 1.2 and low Min P between 0.1-0.3 to control it). It quickly forgets things and smaller details from both the character card and the chat, or does some funny anatomical errors, such as suddenly having three hands. Honestly, it reminds me of the older 13B models, back in the era of Llama-2 ones. The prose is okay, even slightly better in some cases (especially during ERP parts) than Nemo's, but lacks overall creativity. Which is strange, given the datasets you trained it on. It worked on 64,000 context, but again, seemed dumber than the Instruct Nemo, generally lacking awareness and being unlogical at times.

Noticeably, sometimes the model would produce strange artifacts: like forget spaces or capitalizing words which were not supposed to be capitalized. It happened on lower temps too. Not to mention weird formatting from time to time. This happened on a fresh chat test.

Screenshot 2024-07-24 at 15.19.52.png

After chatting with the folks over on Drummer's server, they confirmed having similar thoughts and issues.

Keep in mind that I was testing the model using Quant-Cartel/mini-magnum-12b-v1.1-exl2-longcal quant at 8.0, so there is a possibility that simply the quant was busted. I'll test it again once I get some from Bartowski or Turboderp.

Here are all the settings on ST I was using (just in case I got them wrong and that's why I was getting subpar results):
Parameters: https://files.catbox.moe/jqhm32.json
Story String: https://files.catbox.moe/bxuywb.json
Instruct: https://files.catbox.moe/cqpl56.json

Overall, I hope my feedback does not come off as too negative!!! It's really great to see the first fine-tunes of Nemo, and can't wait to see what you'll do in the future! For now, I'll stick to classic Instruct, but I'll be more than happy to test future potential updates. Keep up the great work!

Thanks for the feedback!
Usually when people do finetunes like this, they build off of instruct. Open source is lacking hard in good instruction tuning data. I am aiming to bridge that gap with more high quality / tailored instruction following data instead of just "training on the official instruction model" which I feel is a shortcut; it does help with intelligence of the resulting model, but it will not enable us to be independent in the long term from official finetuning biases. We should not have to "burn through" the censorship of instruction tunes to get a better tradeoff of intelligence and creative writing quality; ideally, we optimize for both in the same run instead of relying on merges / training ontop of Instruct.

Of course, accomplishing this is not trivial, and what you're seeing are artifacts remniscient of a model before PPO / reward model optimization (re: what almost all good Instruction finetunes use to maximize coherence / correctness, rather than DPO which is... questionable in terms of how effectively it generalizes).

I also would say that the data from c2 is quite biased and sometimes improperly tagged, which absolutely doesn't help.
I plan on creating more Opus instruction data beyond the 25k samples I already made for this, especially data with prefills for specific instructions and more complex sysprompts so it adheres to contexts better.

I know for a fact Anthropic's RL guidelines include data specifically tailored towards anti-incoherence. I.e., given a prompt with questionable / impossible events happening in it, the model will point out these things. I think creating synthetic Instruction data of this nature and covering more "specific" prompting is what will generalize well for intelligence.

Another thing I can do is intentionally prefill Anthropic models with weird outputs; the model usually quickly recovers and gets back on track. So there is some degree of resistance built in that a quality Instruction finetune generalizes to thanks to typical reward modeling objectives, and I think we can partially emulate this behavior with prefilling approaches. If we train on data like this and mask out the "weird" parts from contributing to the loss during training (i.e. they are not learned, only the output after the prefill is), it probably helps implicitly bias towards what we want (an implicit assumption of coherence.)

Your proposal is excellent and I get what you’re trying to achieve, and all the power to you!

The thing with Nemo Instruct, however, is that it is actually less censored than the Base model. Even from my tests with Mini Magnum, I could tell it was more reserved than the classic Instruct. So I’ll risk saying that it might be the only proper safe to fine-tune without much de-censoring needed model out there.

I did some tests with the new iteration of Magnum thanks to intervitens (I think it was version 8) yesterday, and it was way better in instruction-following, but still lacking the overall smartness of the base Instruct. It also straight up ignored some of my instructions to pause the roleplay and describe the character’s appearance for me, interestingly enough.

Regardless, I’m super excited to see what you’re going to cook and once again, thank you for doing the god’s work. 🙏

I’ll risk saying that it might be the only proper safe to fine-tune without much de-censoring needed model out there.

Maybe so, but I also want to build off of models like Qwen 32b in the future. 32b is a great base model, but the official finetune sucks really badly, especially for English.

I wouldn't say it's less censored (seems to be more "neutral" if anything, which is good!) but you are completely correct that any alignment they did is not ridiculously baked in (like how Google's models are). But Nemo 12b is small enough to quickly iterate and get a good "formula" established basically, while not being as small as L3 8b (and presumably with less pretraining filtering for creative writing than L3 which focused hard on coding).

I want to be able to "magnum-ify" any new base model that comes out, and have it work as a fine and proper generalist model (like Claude!) - and not just for writing / RP, so it's good testing grounds to try to match the quality of the official Instruct.

Oh dear gods, yes, base Qwen is absolutely abysmal to use and I hated it, haha. I totally understand what you’re going for now, and fingers crossed you come up with the perfect formula!
I’ll be more than happy to continue testing and comparing Magnums with the official Instruct! Would be really nice if Mistral’s Team shared their Datasets like Nous team did, I’m sure that would help a lot.

I came back to make amends, because I tried the model via GGUF, and it worked MUCH better. No more weird outputs, and I even raised the temperature to 1,05. It looks like exl2 is not yet fully working with Nemo models as intended, from my experiences.

Mini Magnum is currently my go-to model, given that it writes excellently. I am in the process of experimenting with merging it with basic Instruct, to see if I can retain the instruct-following capabilities of the original, but with the writing style of Magnum for best results. Honestly, the style reminds me of the good ol' days back when C.AI was good, making me question whether I was writing with a real human bean or AI. It doesn't produce as unique prose as RP-Stew (with its countless hilarious similes and abstract jokes), but it feels natural and there are barely — if any at all — GPTisms.

Screenshot 2024-07-29 at 10.00.45.png

It still has issues with following character cards, especially on longer contexts when the character simply mellows out, but it might have to do with the Instruct format itself, given that it lacks a proper system prompt (I will forever hate Mistral's cursed ideas). There is also the issue of repeated phrases, but this happens on all Nemo-based models, from my tests so far. I think it's something the base model needs to address.

Overall, it's a great model, and I can't wait to see how it improves further!

I agree after using GGUF the model seems excellent! Has issues like being a bit too aggressive (but some may like this?), consistency where it can mix up / lose track of some details kinda like other smaller models but overall quite intelligent and creative, I like it a lot.

Sign up or log in to comment