question

#1
by WesPro - opened

I just wanted to say that I love the magnum series and I appreciate the decision to expand the models that are being used as base. I assume 9b is probably Gemma2, 12b Mistral-Nemo, 22b Mistral Small, 27B Gemma2, 72b Qwen2.5 and 123B Mistral Large. I'm working on a dataset/lora to train Qwen2.5 14b/32b myself right now and the first test was already really promising even though the dataset is small and needs better formatting as well as a lot of work on details in the dialog part of the data which are mostly unedited Silly Tavern transcripts that were "donated" to me. Since your only qwen2.5 finetune is 72b which sadly is not an option, I would love to see qwen2.5 14b and 32b finetuned on magnum datasets. In my opinion the 32b model is definitely much better than Yi-34b which was part of the v.3 and the 14b is generally also really impressive. If you decide to give them a shot I would definitely appreciate it.

I'll give the 22b/27b versions a try now... Have a nice sunday :)

Anthracite org

We kept trying to finetune Qwen 32B and 14B, but we found that they were just not good enough haha. We're still trying some experiments but no promises. Thank you for the support and good luck.

We kept trying to finetune Qwen 32B and 14B, but we found that they were just not good enough haha. We're still trying some experiments but no promises. Thank you for the support and good luck.

That's interesting. Can you elaborate what you felt was lacking? Did you try them without any further finetuning? I feel like the step between the 1.5 and 2.5 gen is a big improvement but the step from 1.5 to 2.0 not really. Do you like the untuned 14b 32b 2.5 models in general and its just not taking the finetuning data as expected/intended/hoped or do you think they're not the as good as the other current models available with similiar size? Do you use any kind of special formatting for your dataset and do you make custom edits to your data for each model? Did you think the 1.5 Gen of 32b Qwen was doing better any regard and what about the 72b? My first experience with Qwen 14b Instruct was as I already mentioned, really positive and even though it was just 1 epoch on an uncleaned dialog + involved character cards "dataset" it was a pretty big difference when I started to test topics that were part of the data.

Good luck to you too as well as a lot of fun and success with your project

Anthracite org

"its just not taking the finetuning data as expected/intended/hoped"
That pretty much sums the experience up, 32b was just not coming out right and Gemma just looked like the better option, as for 14B, even the base was rife with refusals, so we went with Gemma and NeMo instead.

"Do you like the untuned 14b 32b 2.5 models in general"

While i can't speak for everyone, in my experience 14B instruct was just rife with refusals and slops. 32B was intelligent and makes for a good assistant but is not at all for roleplay / creative writing

"Did you think the 1.5 Gen of 32b Qwen was doing better any regard and what about the 72b?"
Just a thought but possibly red teaming done for the 2.5 generation ruined the performance compared to the 1.5 generation, 72B does seem to be an outlier in it being quite smart and not be prone to refusing the most basic requests.

"its just not taking the finetuning data as expected/intended/hoped"
That pretty much sums the experience up, 32b was just not coming out right and Gemma just looked like the better option, as for 14B, even the base was rife with refusals, so we went with Gemma and NeMo instead.

"Do you like the untuned 14b 32b 2.5 models in general"

While i can't speak for everyone, in my experience 14B instruct was just rife with refusals and slops. 32B was intelligent and makes for a good assistant but is not at all for roleplay / creative writing

"Did you think the 1.5 Gen of 32b Qwen was doing better any regard and what about the 72b?"
Just a thought but possibly red teaming done for the 2.5 generation ruined the performance compared to the 1.5 generation, 72B does seem to be an outlier in it being quite smart and not be prone to refusing the most basic requests.

That's weird since I've never had any refusals. Do you mean refusal as a direct refusal like "I cannot engage in XY since I'm just an LLM and my developers said it's bad" or do you mean it's not properly following the prompts? Maybe it's because I only ever used it inside Silly Tavern so far. I guess you could read up on orthogonal activation steering or use one of the already published versions either as a base or you extract the difference into a lora against the base model in mergekit and use the lora on your model. If you train on data that is about censored/refusal inducing topics it does work to some extent but the orthogonal activation steering/abliteration method should be able to get rid of it more completely making it possible and easier to get a good result in regards to the model adapting to your finetuning espcially since it also should be possible to not just decensor but also achieve specific behavior at least that is what I read about it and what people claim to have achieved. I don't know how much it affects other characteristics of the model but it's the only method that is supposed to actually change refusal behaviour more reliable.

lucyknada changed discussion status to closed
Anthracite org

We'll revisit these models possibly in the future, for now 32b after many attempts was significantly worse than any other models in that range.

Feels like this version is particularly good!

Unfortunately the quality drops off steeply around 8k context as it loops heavily after that... but it's truly phenomenal before

Anthracite org

glad you like it! we trained it at 8k because gemma is also only 8k.

We'll revisit these models possibly in the future, for now 32b after many attempts was significantly worse than any other models in that range.

I just tried the new 32b model aya-expanse from C4AI. For my specific usage cases it performed actually significantly better than Qwen32b so I thought I'd just mention this here. Maybe it's a better fit than Qwen. I think it might be since it seemed to me like it is less aligned/censored than Qwen2.5 and that could at least be one of the reasons why Qwen32b finetuning doesn't produce good enough results compared to other available architectures.

Sign up or log in to comment