Nov 28, 2024

It's really interesting to watch this model family evolve. You got a strong result with Wernicke for all but ifeval, which is inching upward on the subequent merges! How to get strong ifeval without compromising other metrics? I've been working to merge Wernicke with select layers of tanliboy/lambda-qwen2.5-14b-dpo-test. Feel free to use any of that which helps you!

CultriX

Owner Dec 1, 2024

I've been trying to use evolution merging but with datasets that are not used in the official benchmarks for the leaderboard as to uphold the models integrity.
One approach I guess would be to find datasets that test for similar things to the ifeval benchmarks and tell it to focus on improving it's score on that!
Haven't really tried that yet though.

CultriX

Owner Dec 1, 2024

Edit: by the way so far the best performer by far has been the first attempt: CultriX/SeQwence-14Bv1

sometimesanotion

Dec 1, 2024

•

edited Dec 1, 2024

Interesting! CultriX/SeQwence-14Bv1's ifeval performance likely comes from a fresh injection of v000000/Qwen2.5-Lumen-14B. That's one of three models I favor for the 14B ifeval. The other two? tanliboy/lambda-qwen2.5-14b-dpo-test trained on ultrafeedback_binarized (same as Lumen!) and sthenno-com/miscii-14b-1028 trained on HelpSteer2.

If overfitting for the ifeval benchmark is a top concern, I can see good reason to choose Lumen, because it starts with a merge from so many models.

You've hit some great high notes with your dare_ties evolutionary approach. If I'm right in my ideas about how to capture and merge the best features, there should be some benefit to using AgoraMix's recipe, but sticking to Lumen and CultriX/SeQwence-14Bv1 to get ifeval, and CultriX/SeQwence-14B-EvolMerge + Wernicke for reasoning. I'll get started.

sometimesanotion

Dec 1, 2024

•

edited Dec 2, 2024

Lamarck-14B-v0.1-experimental, AgoraMix's recipe used to merge Lumen and your models, has passed its initial checks. See what you think!

I've some more experiments I want to try with Lamarck while restricting it to models you're using, to assess the value of the DELLA, SLERP, and other merge methods to come.

CultriX

Owner Dec 2, 2024

I'll be keeping an eye out on the results you will be getting with those! Great work so far! :)

CultriX

Owner Dec 2, 2024

Just an idea I came up with if you want to try it out: ``` YAML # Final Hybrid Model: Lamarck-14B (Balanced)
name: lamarck-14b-hybrid
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: Qwen/Qwen2.5-14B-Instruct
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.40
weight: 0.60
epsilon: 0.08
lambda: 0.92
models:

model: merges/lamarck-14b-if-della
parameters:
density: 0.60
weight: 0.80
model: merges/lamarck-14b-reason-della
parameters:
density: 0.70
weight: 1.00
dtype: bfloat16
out_dtype: bfloat16

Base Model Preparation: Lamarck-14B Base (Hybrid of Qwen Variants)

name: lamarck-14b-base
merge_method: slerp
base_model: Qwen/Qwen2.5-14B
tokenizer_source: base
parameters:
t: [ 0.00, 0.40, 0.60, 0.80, 0.90 ]
slices:

sources:
- layer_range: [ 0, 8 ]
  model: Qwen/Qwen2.5-14B-Instruct
- layer_range: [ 0, 8 ]
  model: Qwen/Qwen2.5-14B
  t: [ 0.40 ]
sources:
- layer_range: [ 8, 16 ]
  model: Qwen/Qwen2.5-14B-Instruct
- layer_range: [ 8, 16 ]
  model: Qwen/Qwen2.5-14B
  t: [ 0.60 ]
sources:
- layer_range: [ 16, 24 ]
  model: Qwen/Qwen2.5-14B
- layer_range: [ 16, 24 ]
  model: Qwen/Qwen2.5-14B-Instruct
  t: [ 0.80 ]
sources:
- layer_range: [ 24, 32 ]
  model: Qwen/Qwen2.5-14B
- layer_range: [ 24, 32 ]
  model: Qwen/Qwen2.5-14B-Instruct
  t: [ 0.90 ]
dtype: bfloat16
out_dtype: bfloat16

Instruction Following Module: Lamarck-14B-IF

name: lamarck-14b-if-della
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: base
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.30
weight: 0.50
epsilon: 0.09
lambda: 0.95
models:

model: CultriX/SeQwence-14Bv1
parameters:
density: 0.80
weight: 1.00
model: CultriX/SeQwence-14B-v5
parameters:
density: 0.50
weight: [ 0.20, 0.40, 0.50, 0.60, 0.70 ]
dtype: bfloat16
out_dtype: bfloat16

Reasoning Module: Lamarck-14B-Reason

name: lamarck-14b-reason-della
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: base
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.30
weight: 0.50
epsilon: 0.08
lambda: 0.92
models:

model: CultriX/Qwen2.5-14B-Wernicke
parameters:
density: 0.90
weight: 1.00
model: CultriX/SeQwence-14B-EvolMerge
parameters:
density: 0.70
weight: 0.80
dtype: bfloat16
out_dtype: bfloat16

Final Refinement: Lamarck-14B-Finalize

name: lamarck-14b-finalize
merge_method: ties
base_model: merges/lamarck-14b-hybrid
tokenizer_source: Qwen/Qwen2.5-14B-Instruct
parameters:
int8_mask: false
normalize: true
rescale: false
density: 1.00
weight: 1.00
models:

model: merges/lamarck-14b-hybrid
dtype: bfloat16
out_dtype: bfloat16

sometimesanotion

Dec 2, 2024

•

edited Dec 2, 2024

Interesting! I hadn't thought to try a merge for the base model. I have to test it, to see if it behaves as I'd expect with re-instructs at the TIES merge in the end. After all, I want to try Coder merges later on.

I had tried using DELLA for the merge of the IF and reason modules, but DELLA can less forgiving than SLERP with gradients on weight and density. I do want to re-emphasize high-ranked weights though. Very interesting feedback, thank you!

CultriX

Owner Dec 2, 2024

Interesting! I hadn't thought to try a merge for the base model. I have to test it, to see if it behaves as I'd expect with re-instructs at the TIES merge in the end. After all, I want to try Coder merges later on.

I had tried using DELLA for the merge of the IF and reason modules, but DELLA can less forgiving than SLERP with gradients on weight and density. I do want to re-emphasize high-ranked weights though. Very interesting feedback, thank you!

I don't know if it could help you on your research but I ran a ton of benchmarks on various of the models to find their strengths and weaknesses and you can find those here:

https://privatebin.net/?fc8f0f093a7fadc3#6EAp57vYDeKjeFS6BVX7pZmj3vqETMnDvg27tPwqn4hj

password is: benchmarks-hf

sometimesanotion

Dec 3, 2024

•

edited Dec 3, 2024

Got it. This is fascinating! I'm gratified by your inclusion of Lamarck along with its ancestors and other models of yours - and wow. Lamarck is closest to SeQwence-14B-EvolMerge: versus that, it has a 2.7% boost to Winogrande, and 0.3% drops on MMLU and TruthfulQA. Otherwise, they score identically!

However, you have some interesting models I didn't include in the merge - SeQwence-14B-EvolMergev1 has a 0.6% advantage over Lamarck on Winogrande, which is Lamarck's strongest gain, and only loses to Lamarck very slightly on MMLU and TruthfulQA.

I have to hand it to you, CultriX/SeQwence-14Bv1 is a fine model. Lamarck's approach didn't capture its top-row performance on Hellaswag or Wernicke's on Arc - but overall, this is gratifying, and there's plenty of takeaway. I hope it's helped you!

sometimesanotion

Dec 3, 2024

•

edited Dec 3, 2024

This is looking better the more I look at it. Lamarck lands very near EvolMerge except in the one area in which it beats all evolution-merged ancestors. That almost has to be because it sucessfully captured Lumen's advantages without regressions. I believe that advantage can be retained while tuning other parts of the model, and feeding back into the tree.

If you wish to retain this, I suggest splitting the first 2-4 layers from the evolutionary process, while continuing to evolve the others.

sometimesanotion

Dec 3, 2024

•

edited Dec 3, 2024

Based on these encouraging results, I have started another round of Lamarck's merge process, with these selections:

IF module: Lamarck-14B-v0.1-experimental dominant, SeQwence-14B-EvolMergev1 background
Reason module: SeQwence-14Bv1 dominant, SeQwence-14B-EvolMergev1 background

Next up, finding a way to get Wernicke's exceptional Arc and GPQA merged with its peers.

CultriX

Owner Dec 3, 2024

You are starting to talk in ways that are quite hard for my amateur/hobbyist brain to follow now (i dont do anything in the field of AI or ML actually haha, also I suck at math 😉) but wouldn't it be an idea to just try and run a not-too-invasive finetune after the merging?

So instead of just trying to squeeze out the absolute highest benchmark scores purely by merging (which reeks like overfitting on the benchmark tasks instead of actually becoming a way better model):

Create a very strong merge ike yours or my seqwencev1)
Run some RLHF finetuning on it for example some light human preference finetuning (for example with axolotl or llamafactory)
using DPO or ORPO to create a LoRA adapter that you can then easily benchmark: if it improved your desired scores merge it back into the basemodel if not try again (mess around with with lora alpha, lora rank, dataset size and learning rate to see what works without deteriorating the strong points from the base model). You could even freeze certain strong layers or target only lora layers that are relatively "weak" in that area
(Optionally) after Merging the LoRA adapter back to the base model, run evolution Merging again making sure that the benchmarks the model before finetuning did well on are tested for but to a lesser degree thsn your desired new improvements.

With my very limited knowledge this seems like it would be an idea that is feasible for a normal person (compute and time wise), largely automatable and pretty fast to see if results are promising or not.

CultriX

Owner Dec 3, 2024

I mean yeah its more work than simply merging but i feel its within reach of whats doable and might actually be easier than looking for the perfect configuration for the perfect merge. Besides, no matter how perfect the merge it might just need some additional data to truly improve.

sometimesanotion

Dec 4, 2024

I'm a software dev but also a hobbyist in this field, no worries! Quite right that merging only goes so far, and finetuning an adapter is the way to go after merging plateaus. It takes a lot more compute, though, and I think we've got some things left to try yet.

Time to see how Arcee's new Virtuoso Small does as part of the merge!

sometimesanotion

Dec 4, 2024

This is proving to be an interesting merge: sometimesanotion/Lamarck-14B-v0.2-experimental

CultriX

Owner Dec 4, 2024

•

edited Dec 4, 2024

Like I said by no means an expert but I've done some basic finetuning (for example my Wernicke-DPO model finetuned on a small subset of the uptodate database I generated) on a single A40 GPU (48gb of VRAM) using QLoRA / LoRA which only took about 1-2 hours so that's less than a dollar on RunPod instances. The reason why I mentioned here is that I think that it could be a good method to get just those layers that are lacking right now (you seem to have identified those pretty well imo) and then try to turn that into an overal solid model using the methods you are already using :)!

(Edit: If im talking crap, i'm not even a software dev haha I don't do anything in the IT world for my professional life so this is all just copying what I see others do and a lot of trial and error! With a few successes so far though, also thanks to @mlabonne )

sometimesanotion

Dec 4, 2024

Labonne's work is definitely inspiring! And thank you for showing what the evolutionary toolkits can do. I'm learning a lot here.

It might surprise you - I'm mostly watching the first and last layers, inspired by work on models with differential attention. That inspired the emphasis of Lumen. I'm still catching up on what's going on for the middle layers. You have really strong entries here - Wernicke alone is worth more study. For that, I'm considering a sequence of model_stock, then DELLA, then breadcrumbs - and at some point refining @jeffmeloy 's ner_merging into something capable of processing the assortment of reasoning models, while using less memory.

sometimesanotion

Dec 5, 2024

Earlier, you had a proposed merge recipe based on Lamarck that I haven't followed up on, because,there's been one feature of it I'm not sure I understood. That's the SLERP to make a new base model, on a gradient towards instruct.

I've wondered what that'd accomplish compared to using the existing base, but it occurs to me that once one iteratively evolve-merges, this might help TIES select the right signs. Is this why? Or is this to help prevent the need for a re-instruct TIES at the end?

sometimesanotion

Dec 9, 2024

FYI, Lamarck 0.3's results are in! It snagged #1 for 14B BBH and #2 for 14B GPQA, and I think your models played a large part in this.

The BBH result, I attribute to the first part of the YAML recipe, in which a DELLA merge re-emphasized your SeQwence-14B-EvolMerge and Wernicke over a model stock which included them and VAGOsolutions/SauerkrautLM-v2-14b-DPO.

The #2 GPQA result is just a bit below Wernicke's #1, but it clearly inherited more of Wernicke where it counts than other merges which attempted that. This might be useful to you!

However, instruction following fell off a cliff, and I think some changes since v0.1 which you evaluated will explain that.

sometimesanotion

Dec 17, 2024

•

edited Dec 17, 2024

Coming back to this - I'd like to thank you for the ideas in your proposal! I created a Virtuoso "base" model similar to what you described, and used a LoRA from it to hopefully solve instruction-following issues in Lamarck's v0.5 release. EvolMerge has kept a large role in its reasoning branch, but interestingly, my prose model_stock merge posted some very high GPQA numbers, nearly tied with Wernicke. I hadn't even done any DELLA tweaks or adapters or anything. Just, heads up in case it helps you!

You'll find the size of your contribution in the YAML. I'm hoping this is the release that nails instruction following, reason, and prose altogether. If it does, your base model trick will be part of the reason why.

sometimesanotion changed discussion status to closed Dec 19, 2024

CultriX
/

SeQwence-14Bv3

Interesting methods and results

Base Model Preparation: Lamarck-14B Base (Hybrid of Qwen Variants)

Instruction Following Module: Lamarck-14B-IF

Reasoning Module: Lamarck-14B-Reason

Final Refinement: Lamarck-14B-Finalize