Interesting methods and results
It's really interesting to watch this model family evolve. You got a strong result with Wernicke for all but ifeval, which is inching upward on the subequent merges! How to get strong ifeval without compromising other metrics? I've been working to merge Wernicke with select layers of tanliboy/lambda-qwen2.5-14b-dpo-test. Feel free to use any of that which helps you!
I've been trying to use evolution merging but with datasets that are not used in the official benchmarks for the leaderboard as to uphold the models integrity.
One approach I guess would be to find datasets that test for similar things to the ifeval benchmarks and tell it to focus on improving it's score on that!
Haven't really tried that yet though.
Edit: by the way so far the best performer by far has been the first attempt: CultriX/SeQwence-14Bv1
Interesting! CultriX/SeQwence-14Bv1's ifeval performance likely comes from a fresh injection of v000000/Qwen2.5-Lumen-14B. That's one of three models I favor for the 14B ifeval. The other two? tanliboy/lambda-qwen2.5-14b-dpo-test trained on ultrafeedback_binarized (same as Lumen!) and sthenno-com/miscii-14b-1028 trained on HelpSteer2.
If overfitting for the ifeval benchmark is a top concern, I can see good reason to choose Lumen, because it starts with a merge from so many models.
You've hit some great high notes with your dare_ties evolutionary approach. If I'm right in my ideas about how to capture and merge the best features, there should be some benefit to using AgoraMix's recipe, but sticking to Lumen and CultriX/SeQwence-14Bv1 to get ifeval, and CultriX/SeQwence-14B-EvolMerge + Wernicke for reasoning. I'll get started.
Lamarck-14B-v0.1-experimental, AgoraMix's recipe used to merge Lumen and your models, has passed its initial checks. See what you think!
I've some more experiments I want to try with Lamarck while restricting it to models you're using, to assess the value of the DELLA, SLERP, and other merge methods to come.
I'll be keeping an eye out on the results you will be getting with those! Great work so far! :)
Just an idea I came up with if you want to try it out: ``` YAML # Final Hybrid Model: Lamarck-14B (Balanced)
name: lamarck-14b-hybrid
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: Qwen/Qwen2.5-14B-Instruct
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.40
weight: 0.60
epsilon: 0.08
lambda: 0.92
models:
- model: merges/lamarck-14b-if-della
parameters:
density: 0.60
weight: 0.80 - model: merges/lamarck-14b-reason-della
parameters:
density: 0.70
weight: 1.00
dtype: bfloat16
out_dtype: bfloat16
Base Model Preparation: Lamarck-14B Base (Hybrid of Qwen Variants)
name: lamarck-14b-base
merge_method: slerp
base_model: Qwen/Qwen2.5-14B
tokenizer_source: base
parameters:
t: [ 0.00, 0.40, 0.60, 0.80, 0.90 ]
slices:
- sources:
- layer_range: [ 0, 8 ]
model: Qwen/Qwen2.5-14B-Instruct - layer_range: [ 0, 8 ]
model: Qwen/Qwen2.5-14B
t: [ 0.40 ]
- layer_range: [ 0, 8 ]
- sources:
- layer_range: [ 8, 16 ]
model: Qwen/Qwen2.5-14B-Instruct - layer_range: [ 8, 16 ]
model: Qwen/Qwen2.5-14B
t: [ 0.60 ]
- layer_range: [ 8, 16 ]
- sources:
- layer_range: [ 16, 24 ]
model: Qwen/Qwen2.5-14B - layer_range: [ 16, 24 ]
model: Qwen/Qwen2.5-14B-Instruct
t: [ 0.80 ]
- layer_range: [ 16, 24 ]
- sources:
- layer_range: [ 24, 32 ]
model: Qwen/Qwen2.5-14B - layer_range: [ 24, 32 ]
model: Qwen/Qwen2.5-14B-Instruct
t: [ 0.90 ]
dtype: bfloat16
out_dtype: bfloat16 - layer_range: [ 24, 32 ]
Instruction Following Module: Lamarck-14B-IF
name: lamarck-14b-if-della
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: base
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.30
weight: 0.50
epsilon: 0.09
lambda: 0.95
models:
- model: CultriX/SeQwence-14Bv1
parameters:
density: 0.80
weight: 1.00 - model: CultriX/SeQwence-14B-v5
parameters:
density: 0.50
weight: [ 0.20, 0.40, 0.50, 0.60, 0.70 ]
dtype: bfloat16
out_dtype: bfloat16
Reasoning Module: Lamarck-14B-Reason
name: lamarck-14b-reason-della
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: base
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.30
weight: 0.50
epsilon: 0.08
lambda: 0.92
models:
- model: CultriX/Qwen2.5-14B-Wernicke
parameters:
density: 0.90
weight: 1.00 - model: CultriX/SeQwence-14B-EvolMerge
parameters:
density: 0.70
weight: 0.80
dtype: bfloat16
out_dtype: bfloat16
Final Refinement: Lamarck-14B-Finalize
name: lamarck-14b-finalize
merge_method: ties
base_model: merges/lamarck-14b-hybrid
tokenizer_source: Qwen/Qwen2.5-14B-Instruct
parameters:
int8_mask: false
normalize: true
rescale: false
density: 1.00
weight: 1.00
models:
- model: merges/lamarck-14b-hybrid
dtype: bfloat16
out_dtype: bfloat16
Interesting! I hadn't thought to try a merge for the base model. I have to test it, to see if it behaves as I'd expect with re-instructs at the TIES merge in the end. After all, I want to try Coder merges later on.
I had tried using DELLA for the merge of the IF and reason modules, but DELLA can less forgiving than SLERP with gradients on weight and density. I do want to re-emphasize high-ranked weights though. Very interesting feedback, thank you!
Interesting! I hadn't thought to try a merge for the base model. I have to test it, to see if it behaves as I'd expect with re-instructs at the TIES merge in the end. After all, I want to try Coder merges later on.
I had tried using DELLA for the merge of the IF and reason modules, but DELLA can less forgiving than SLERP with gradients on weight and density. I do want to re-emphasize high-ranked weights though. Very interesting feedback, thank you!
I don't know if it could help you on your research but I ran a ton of benchmarks on various of the models to find their strengths and weaknesses and you can find those here:
https://privatebin.net/?fc8f0f093a7fadc3#6EAp57vYDeKjeFS6BVX7pZmj3vqETMnDvg27tPwqn4hj
password is: benchmarks-hf
Got it. This is fascinating! I'm gratified by your inclusion of Lamarck along with its ancestors and other models of yours - and wow. Lamarck is closest to SeQwence-14B-EvolMerge: versus that, it has a 2.7% boost to Winogrande, and 0.3% drops on MMLU and TruthfulQA. Otherwise, they score identically!
However, you have some interesting models I didn't include in the merge - SeQwence-14B-EvolMergev1 has a 0.6% advantage over Lamarck on Winogrande, which is Lamarck's strongest gain, and only loses to Lamarck very slightly on MMLU and TruthfulQA.
I have to hand it to you, CultriX/SeQwence-14Bv1 is a fine model. Lamarck's approach didn't capture its top-row performance on Hellaswag or Wernicke's on Arc - but overall, this is gratifying, and there's plenty of takeaway. I hope it's helped you!
This is looking better the more I look at it. Lamarck lands very near EvolMerge except in the one area in which it beats all evolution-merged ancestors. That almost has to be because it sucessfully captured Lumen's advantages without regressions. I believe that advantage can be retained while tuning other parts of the model, and feeding back into the tree.
If you wish to retain this, I suggest splitting the first 2-4 layers from the evolutionary process, while continuing to evolve the others.
Based on these encouraging results, I have started another round of Lamarck's merge process, with these selections:
IF module: Lamarck-14B-v0.1-experimental dominant, SeQwence-14B-EvolMergev1 background
Reason module: SeQwence-14Bv1 dominant, SeQwence-14B-EvolMergev1 background
Next up, finding a way to get Wernicke's exceptional Arc and GPQA merged with its peers.
You are starting to talk in ways that are quite hard for my amateur/hobbyist brain to follow now (i dont do anything in the field of AI or ML actually haha, also I suck at math 😉) but wouldn't it be an idea to just try and run a not-too-invasive finetune after the merging?
So instead of just trying to squeeze out the absolute highest benchmark scores purely by merging (which reeks like overfitting on the benchmark tasks instead of actually becoming a way better model):
- Create a very strong merge ike yours or my seqwencev1)
- Run some RLHF finetuning on it for example some light human preference finetuning (for example with axolotl or llamafactory)
- using DPO or ORPO to create a LoRA adapter that you can then easily benchmark: if it improved your desired scores merge it back into the basemodel if not try again (mess around with with lora alpha, lora rank, dataset size and learning rate to see what works without deteriorating the strong points from the base model). You could even freeze certain strong layers or target only lora layers that are relatively "weak" in that area
- (Optionally) after Merging the LoRA adapter back to the base model, run evolution Merging again making sure that the benchmarks the model before finetuning did well on are tested for but to a lesser degree thsn your desired new improvements.
With my very limited knowledge this seems like it would be an idea that is feasible for a normal person (compute and time wise), largely automatable and pretty fast to see if results are promising or not.
I mean yeah its more work than simply merging but i feel its within reach of whats doable and might actually be easier than looking for the perfect configuration for the perfect merge. Besides, no matter how perfect the merge it might just need some additional data to truly improve.
I'm a software dev but also a hobbyist in this field, no worries! Quite right that merging only goes so far, and finetuning an adapter is the way to go after merging plateaus. It takes a lot more compute, though, and I think we've got some things left to try yet.
Time to see how Arcee's new Virtuoso Small does as part of the merge!
This is proving to be an interesting merge: sometimesanotion/Lamarck-14B-v0.2-experimental