Extra SLERP parameters

#1
by sometimesanotion - opened

These are interesting SLERP directives you've used! I've tried your recipe with minor tweaks at sometimesanotion/Qwen2.5-14B-MinusLike-Slerp-Experimental, using Arcee's mergekit-gui space. Any guesses how these SLERP merges will score?

These are interesting SLERP directives you've used! I've tried your recipe with minor tweaks at sometimesanotion/Qwen2.5-14B-MinusLike-Slerp-Experimental, using Arcee's mergekit-gui space. Any guesses how these SLERP merges will score?

I must say that the new technologies used in these projects are truly impressive to me. I have just recently learned about them. In some of my new experimental projects like tempesthenno-nuslerp-001, I've drawn significant inspiration from your remarkable project Lamarck-14B-v0.6, and I believe @bamec66557 's Qwen-2.5-14B-MINUS will also serve as a role model for my learning in the next steps.

However, I have some personal concerns. In an era where computational costs are consistently decreasing, can we push the boundaries even further? While @arcee-ai's research and work are highly valuable references, I'm concerned their approach may eventually reach an optimizable limit (regardless of evaluation methods) in terms of "real performance" β€” perhaps we're already approaching that edge, at least for 14B models. Thus, what will be our next direction for advancement β€” Reinforcement Learning, or perhaps just expanding model size (personally, I don't think this is a reliable approach, since once the size has been increased, we likely have no way to scale it back down)?

[UPDATE] 2024-01-22, perhaps I need to @sometimesanotion :)

Thank you for nudging me! This question got a bit more interesting these last few days. Sthenno, I believe your models had a positive integrating effect on my project, particularly for IFEVAL and BBH where your focus lies.

I am curious about these novel SLERP parameters, but they are not yet a part of my methods. The place SLERP has in my process is to find and use a right mix between stable DELLA merges from the base model, and aggressive breadcrumbs branches which modify a model_stock (which you see as Qwenvergence). Yes, breadcrumbs. It's a risky but rewarding approach, to be used sparingly and targeting the right layers when model_stock has already done most of the work. I take guidance from papers about LoRAs which showed rank 512 extracts capture around 30% of the model and achieved nearly 90% of the source model's results - these final touches can be light.

These methods have more room for success when you have varied finetunes with different specializations to draw the right mix from. Krystalan/DRT-o1-14B definitely made Lamarck's translations consistently better.

I think you're right, @sthenno - we are nearing limits. However, even before DeepSeek R1 arrived, I was seeing signs that we've got headroom for more MATH and MUSR without much tradeoff, and that those two are synergistic. Look up sometimesanotion/Qwenvergence-14B-v9 in comparator to see what I mean. DeepSeek R1 just underlines that point massively.

I think 14B has terrific bang for the buck, and I'd sooner have agentic mixture of models with a few small specialists along with a good 14B than the sum of their parameters in a single model.

Thank you for nudging me! This question got a bit more interesting these last few days. Sthenno, I believe your models had a positive integrating effect on my project, particularly for IFEVAL and BBH where your focus lies.

I am curious about these novel SLERP parameters, but they are not yet a part of my methods. The place SLERP has in my process is to find and use a right mix between stable DELLA merges from the base model, and aggressive breadcrumbs branches which modify a model_stock (which you see as Qwenvergence). Yes, breadcrumbs. It's a risky but rewarding approach, to be used sparingly and targeting the right layers when model_stock has already done most of the work. I take guidance from papers about LoRAs which showed rank 512 extracts capture around 30% of the model and achieved nearly 90% of the source model's results - these final touches can be light.

These methods have more room for success when you have varied finetunes with different specializations to draw the right mix from. Krystalan/DRT-o1-14B definitely made Lamarck's translations consistently better.

I think you're right, @sthenno - we are nearing limits. However, even before DeepSeek R1 arrived, I was seeing signs that we've got headroom for more MATH and MUSR without much tradeoff, and that those two are synergistic. Look up sometimesanotion/Qwenvergence-14B-v9 in comparator to see what I mean. DeepSeek R1 just underlines that point massively.

I think 14B has terrific bang for the buck, and I'd sooner have agentic mixture of models with a few small specialists along with a good 14B than the sum of their parameters in a single model.

DeepSeek-R1 appears to have used a large amount of Chain-of-Thought reasoning and constant self-checking to improve its accuracy on the MATH dataset evaluation.
When I did an initial comparison between gemini-exp-1206 and DeepSeek R1 using high-difficulty math problems from the AIME dataset for local testing, gemini-exp-1206 achieved higher accuracy despite having response lengths only about 1/5 of DeepSeek-R1's.

Interestingly, when I used gemini-exp-1206 to analyze the shortcomings of DeepSeek-R1-Distilled-14b in solving math problems, it seemed to identify the core issue:

The key to problem-solving lies in understanding the essence of the problem and its mathematical principles, rather than blindly pursuing complex computational processes. If one focuses too much on numerical calculations while ignoring the underlying mathematical principles, it's like "blind men touching an elephant" - only seeing the surface without grasping the whole picture. When encountering difficulties in problem-solving, one should return to the root of the problem and seek elegant, concise solutions rather than getting lost in tedious calculations.

So returning to our questions, what do you think about the current (14B model):

  1. Does the depth (48) of the model's MLP hidden layers limit its semantic understanding capabilities?
  2. If so, in which range do you think the optimal model parameter count should fall? (e.g., [14, 32] or [32, 78])
  3. Based on the gap in MMLU-PRO evaluation scores relative to model parameter size, do you believe this is caused by (1.)?
  4. Should we collaborate to build a new model architecture and maintain it going forward?

Mmm, there are good questions that will have different answers based on goals. My goals are to have just enough AI paired with tools and RAG/CAG to make classical, QA'ed code and data for a reasoning generalist capability, based on hard data. You are ambitious in the realms of rich persona and language understanding! Is this why your models score higher than mine in some areas? :) So, we can start from one clear overlap: what simple things can we do to improve things, while leaving bigger options open? Here's a simple idea: how about looking at this merge Sorawiz/Enricha-14B-B-Test between Lamarck with one of the strongest prose models - in 32 bit? There will be differences. I'd be curious to see if an imatrix quant should keep more tensors F32 for prose. So:

  1. Likely yes; I say this as a very engaged layman and software engineer, not as a LLM architect. The leaderboard shows a clear gradient of top performers by model size and depth, both for MMLU and GPQA. These are very difficult scores to improve, past a point.

  2. Because I value efficiency and want to make the most of limited local compute, I've chosen 14B to be the biggest model size in a mixture of agents framework, with other small specialists trained by others or merged/upcycled from other work. If we promote those specialized models to a size equal to the generalist, I think 14B is still a sweet spot for distillation of specialists. Does that further your goal enough? No? 32B is clearly a strong tier of model. I haven't really had the resources at hand to investigate, though.

  3. Definitely. I believe backward pass and other techniques can buy us a bit more runway, though. Those other techniques - they're on my far horizon, I can't speak to them. If your goal is translation and language comprehension, though, don't miss https://arxiv.org/abs/2412.17498 - I really think the combination of DRT and medius-erebus-magnum can get some creative, if not so precise or verifiable output. Recommended if you want rich games or fiction!

  4. Let's ponder goals. For the moment I follow a "just enough" approach with an eye towards training and distilling very small models, and if 32B parameters and above for utmost quality in one model becomes a goal, a phase of improving small models will help to rapidly iterate and refine the tools and methods for bigger projects later on. I wish Mixture-of-Adapters was more a more mainstream architectural feature, though my understand is that it will severely restrict quantization. I'm also toying with MoE concepts, chiefly to make the most of upcycled small models.

I'll pause here so we can research and ponder.

Since you're making a point about where clear initial understanding is better than a long chain of thought, differential attention seems to be an important topic to brush up on. https://medium.com/@isaakmwangi2018/intro-to-differential-transformers-a-new-attention-mechanisms-for-large-language-models-llms-9d977b5857ae

Hi, @sthenno ! Let me point you to a youtube video which gave an example of duplicating the first two layers of an SLM to use them for a sort of differential attention. I haven't gotten around to trying this with a Qwen model, but if it works in this case, it might be cost-effective, and it seems to align with your goal of comprehending the core questions.

@sometimesanotion Got your messages. I’ll reply in a detail in several days.

Sign up or log in to comment