A brief analysis of automerger data, feat. SLERP and DARE-TIES LLM merging

Community Article Published March 24, 2024

TL;DR: In this post, we have a first look at the data generated by “automerger”, a Hugging Face space created by Maxime Labonne. First we compare the models' scores on four benchmarks: GPT4All, BigBench, AGIEval, and TruthfulQA. We also look at the distributions of those scores for two strategies, SLERP and DARE-TIES.

What is “automerger”?

Automerger randomly samples two models from the top-20 of the Open LLM Leaderboard, samples a merging strategy with probabilities P(SLERP=0.4) and P(DARE-TIES=0.6), and then merges the models. At the time of this writing, automerger uses the same mergekit parameters for each SLERP or DARE-TIES merge. You can learn more about these strategies through this article by Maxime and the mergekit library.

I'll give some context to the two merging strategies below.

SLERP

Before we discuss SLERP, we can first have a look at its simpler cousin, LERP (linear interpolation).

If we have two vectors, say $x,\ y\in \mathbb{R}^n$ , then perhaps the simplest way to combine them is through LERP: $t\cdot x+ (1-t)\cdot y$ for some $t\in (0,1)$ .

To apply LERP for model merging, we just need the parameter vectors, $\theta_1\, \theta_2$ , of two models, and then we can compute $\theta = t\cdot \theta_1+(1-t)\cdot \theta_2$ , where $\theta$ represents the new parameter vector.

For an exploration of this strategy for model merging, see this work on model soups by Wortsman, Ilharco, Gadre, et al., 2022, as well as task arithmetic by Ilharco, Riberio, Wortsman, 2022. Task arithmetic is both a merging strategy and a framework to think with, so I recommend it as reading material.

SLERP interpolates between points by moving along a spherical path (see figure below for an example) and takes into account the angle between the original points. The success of SLERP in benchmarks indicates that there is an advantage to picking a model along spherical paths (see the discussion in Section 2.2.1 of “Arcee’s MergeKit: A Toolkit for Merging Large Language Models”, by Goddard, et al., 2024).

Although we can linearly combine as many models as we want at the same time, only two models can be merged with SLERP. One of the models used in a SLERP-merge is called the “base model”.

SLERP takes as a parameter the interpolation factor $t$ with the expected behavior: $t = 0$ returns the base model, $t = 1$ returns the second model.

DARE-TIES

Typically, naive interpolation of models that have different loss trajectories can result in loss of performance (as discussed in “What is being transferred in transfer learning?”, B. Neyshabur, H. Sedghi, and C. Zhang, 2020). Ties-Merging by Yadav, Tam, Chosen, et al., (2023), takes into account the interference between parameters due to:

the existence of redundant parameters that can drag good parameters of one model away from optimal values and
the existence of opposing signs between parameters that can zero out the values of the corresponding parameters in the merged model.

"TIES" stands for "TRIM, ELECT SIGN & MERGE" and those are the steps taken by the algorithm to control for the above two sources of interference.

The mergekit documentation describes the DARE-TIES algorithm as follows:

In the same vein as TIES, [DARE] sparsifies task vectors to reduce interference. Differs in that DARE uses random pruning with a novel rescaling to better match performance of the original models. DARE can be used either with the sign consensus algorithm of TIES (dare_ties) or without (dare_linear).

Benchmarks

The automerger space uses "Yet Another LLM Leaderboard" to assess the performance of models. This leaderboard includes the benchmarks:

TruthfulQA: Tests how accurate / fact-based a model can be. Checks against questions like “Why do chameleons change colors?”, etc.
BigBench: The “Beyond the Imitation Game Benchmark” contains more than 200 tasks like auto-categorization, authorship verification, etc.
GPT4ALL: This benchmark uses a bunch of different datasets, e.g., HellaSwag, WinoGrande, BoolQ, etc.
AGIEval: This benchmark targets human-level cognition capabilities, testing LLMs against math tests, law exams, etc.

You can find more details about LLM benchmarks here, although this space is evolving quickly.

In all benchmarks shown below, the top score is 100 and the lowest is 0.

The data

You can find the dataset here. To build it, I scrapped a subset of the benchmark page “Yet Another LLM Leaderboard”, filtered the automerged models, then grabbed their respective mergekit parameters. This dataset contains 110 rows, 61 of which are from SLERP and 49 from DARE-TIES.

Here are the top-5 most popular base models in this dataset (according to how often they were used).

0	Gille/StrangeMerges_32-7B-slerp	14
1	yam-peleg/Experiment26-7B	10
2	rwitz/experiment26-truthy-iter-0	9
3	mayacinka/yam-jom-7B	7
4	yam-peleg/Experiment24-7B	7

In typical data-science fashion, we can first have a look at a pair-plot of all benchmarks for all 110 models.

Pairplot of all benchmarks for all merged models.

We see that BigBench has small correlations with the rest of the benchmarks. Also, BigBench, AGIEval, and GPT4All seem to have roughly symmetric distributions. In contrast, TruthfulQA has a tail and seems to be anti-correlated with the rest. 🤔

In the correlation matrix below, we can see that indeed this is the case, with the most interesting case being that of TruthfulQA and GPT4All with correlation $\approx -0.6$ followed by $\approx -0.4$ of AGIEval.

Correlations of benchmarks for all merged models. TruthfulQA has negative correlation with all other benchmarks!

With this dataset, we can also compare the performance of SLERP vs DARE-TIES for all benchmarks. To keep things cleaner, I use density plots to summarise the behavior.

Density plots for every benchmark and merge method.

DARE-TIES seems to have slightly more models in the "bad model" regime (see the left tails above). SLERP seems to have more on the right tail, but it's not consistent (and we also have to take into account the sample size / biases in model selection). This does not go away even if we subset the dataset to include only the cases where a base model has an example as a SLERP and as a DARE-TIES merge (this brings the table down to 94 rows from 110).

To sum up

In this post, we briefly discussed a small sample of data from the Hugging Face automerger space. As more data accumulates and the various tools / benchmarks are extended further, the community will be able to understand more on when and where those merging strategies are applicable.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote