open-llm-leaderboard/open_llm_leaderboard · Request for separate Leaderboard for Merged Models

ajibawa-2023

Dec 15, 2023

Hello,
First of all, I am extremely thankful to HF team for making so much contribution to OSS community.
It will be great if you guys (HF team) can create a separate leaderboard for Merged models. We are taking all the efforts to fully finetune a model and soon someone will merge those models with others which places these merged models at very top. I am not against merging of models & neither against the ranking but separate board will be great initiative.
Looking forward to a positive response.

deleted

Dec 15, 2023

@ajibawa-2023 Slerp merging 2+ quality fine-tuned models with non-overlapping blind spots, such as Starling/Neural with OpenHermes/Dolphin, results in a better performer overall, so I too am glad they exist.

But it would be nice to at least identify LLMs that piggy-back on others. Not just mergers, but those which apply additional fine-tuning. While most do either name them appropriately or put the info in their model cards, it would be handy to more easily identify them at a glance, such as with your mentioned merger leaderboard, or simply using different text within the same leaderboard (e.g. bolder text) for non-merged or modified fine-tunes of foundational models.

brucethemoose

Dec 16, 2023

•

edited Dec 16, 2023

This is kind of a gray area, as some merges are finetuned after the merge. DPO finetuning merges seems to be particularly popular right now. And then there's the case of "finetuned finetunes" as Phil points out, which I expect to get more common anyway. If you can build a better OSS model on someone else's shoulders, why not do it?

And it looks like merging is about to get a lot more complicated thanks to Mixtral. What if, for instance, someone includes a 1st-generation finetune in a Mixstral mix and then finetunes the selector: https://goddard.blog/posts/clown-moe/

...I think the fundamental problem you are trying to get at is that HuggingFace has a discoverability problem for novel new LLMs, for a number of reasons. So people kinda use the OpenLLM leaderboard to find models, which is also problematic.

ajibawa-2023

Dec 16, 2023

Thanks @Phil337 . Some merged models are amazing & fun to use/try. Another reason for separate leaderboard is to have quick discoverability as mentioned by @brucethemoose . I also feel we need more broader evaluation metrics especially for merged models.
@brucethemoose I am in favor of standing on the shoulders of the giants and that is how we can progress ahead for sure. Different leaderboard will be useful to even layman who may want to try model for fun. Same leaderboard will spur competition as well.
Thanks @Phil337 & @brucethemoose for expanding the topic.

deleted

Dec 16, 2023

•

edited Dec 16, 2023

@ajibawa-2023 As someone who enjoys testing LLMs the flood of mods by inexperienced contributes with egos vying to be #1 is undeniably making the leaderboard less useful, and it seem inevitable that the situation is only going to get worse. Some are making v1 v2 v3... within days of each other trying to find a way to squeeze out a fraction of a point more so they can climb higher on the board, even if it means deliberate contamination. And when I test their LLMs they actually perform worse, especially on tasks that standardized LLM tests can't evaluate (e.g. story telling).

However, I'm a little confused by what you're proposing. Are you saying all mergers should be REMOVED from the main leaderboard and given their own? And does this also include additional fine-tuning of fine-tunes? Personally I'd be OK with just have everything clearly labeled and filterable. Such as foundational models have a green icon, first-level fine-tunes would have a distinct color, fine-tunes of someone else's first-level fine-tune as another color, and mergers as another color. And if you could filter out the mergers and piggy-backing fine-tunes with a click it would be like having a separate leaderboard for them.

Edit: And I'm also for only having foundational and first-level fine-tunes selected by default. Otherwise they would become buried by the flood of mergers and piggy-backing fine-tunes.

deleted

Dec 16, 2023

Should the following Mistral 7b appear at the top of the leaderboard above all Llama 2 70bs, Yi-34bs and Mixtral Instruct?

https://huggingface.co/rwitz2/go-bruins-v2.1.1

ajibawa-2023

Dec 16, 2023

Are you saying all mergers should be REMOVED from the main leaderboard and given their own? And does this also include additional fine-tuning of fine-tunes?
Edit: And I'm also for only having foundational and first-level fine-tunes selected by default. Otherwise they would become buried by the flood of mergers and piggy-backing fine-tunes.

@Phil337 Thanks for making it clear, for everyone.

brucethemoose

Dec 16, 2023

•

edited Dec 16, 2023

Yeah I'm not saying its not a good idea... Just that its not necessarily practical.

In addition to the gray area concerns, many uploaders barely even make model cards. I think most aren't going to label whether its a first order finetune or not.

ajibawa-2023

Dec 16, 2023

I agree @brucethemoose but leaderboard will be a first step and subsequently it can be improved.

deleted

Dec 16, 2023

@brucethemoose The world is filled with bad faith actors, as well as mistake prone noobs like me. It's time for HF to accept this. A reasonable (non-severe) punishment system should be considered. For example, if caught merging or fine-tuning a non-base model without labeling it appropriately, even by accident, can result in the LLM being removed, a warning sent and a temporary upload freeze enacted, which becomes longer with each offense. Plus all newly created non-institutional accounts should have an upload freeze to minimize immediate resurrections.

ajibawa-2023

Dec 17, 2023

Hello @SaylorTwift , Any views from HF side. I am not sure if you are the right person.
Thanks

deleted

Dec 17, 2023

@ajibawa-2023 HF tends to wait a while to respond to feature requests so that people have time to add their opinions, they can consider the options and so on.

ajibawa-2023

Dec 17, 2023

Thanks @Phil337

clefourrier

Open LLM Leaderboard org Dec 18, 2023

Hi!
Thank you all for such a quality discussion.
I don't think that we want to create a separate leaderboard for all merges, since, as mentioned by other users, some can really bring something to the table and be of high quality.

However, I agree that changing which models are displayed by default could make the leaderboard more useful at a first glance. We could add a category for merged models, what would you folks think? (We would still rely on user good faith when tagging their models though).

We also added some checks on the model cards, and are considering adding more - and we are also working on adding more contamination checks. One thing which would be really helpful would be a good way to detect the taxonomy of models, so we could see the "level" at which a model is (pretrained = 0, finetune/merge one time = 1, two times = 2, ...).

@davanstrien , I think you worked on this a bit, do you have some insights?

clefourrier

Open LLM Leaderboard org Dec 18, 2023

(Regarding answer times: I don't work on Saturday/Sunday, so I'll usually come back to these discussions on Mondays)

Yure-Pleb

Dec 18, 2023

As mentioned here: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657f7a0dcec775bfe00e7143
I think requiring models to indicate their lineage makes a lot of sense.
Potential issues can be traced back to the root much more easily, to indentify where the problem originated. As well as providing more transparency about the model, this also makes it easier to find exactly which datasets it has been finetuned on (if that has been disclosed by the models).

If this lineage is added, merged models can be easily categorized as such by simply looking at the lineage. However, this does not solve the issue of dishonest users lying about their model's lineage. Even if this doesn't catch all bad actors, I do believe this improves the readability of the leaderboard.

I don't know it's feasible, but it would be great to link all child models to the parent model(s). For example, users could then filter on all models that stem from "dolphin-2.1-mistral-7b" if they want an uncensored model or "yi-34B-200k" if they want 200k context model stemming from Yi.

brucethemoose

Dec 18, 2023

Yeah, +1 to more contamination checks and a merge model category.

I think a lineage requirement is a good idea too, but I have to wonder how many users will actually implement it, especially retroactively.

davanstrien

Dec 18, 2023

For the issue of model merges, we're working on some better ways of knowing if a model is a merge or not. I've already started adding a merge tag to models on the Hub. You'll then be able to find merged models via https://huggingface.co/models?other=merge. This could be used as a filter on the leaderboard in the future (if adoption is high enough!

You can make a PR to add this tag to any models you come across where this tag might be relevant. See example pull request.

ajibawa-2023

Dec 19, 2023

I am very glad to know that we are making some good progress.

clefourrier

Open LLM Leaderboard org Dec 19, 2023

You can now display if models are merges, if model authors used the merge tag!

Closing, as it should solve part of this discussion! Many thanks to @davanstrien for adding the tag to many models!

clefourrier changed discussion status to closed Dec 19, 2023

ajibawa-2023

Dec 19, 2023

Thank you very much @clefourrier . Great work @davanstrien . Thanks everyone!

ajibawa-2023 changed discussion status to open Dec 19, 2023

ajibawa-2023 changed discussion status to closed Dec 19, 2023

deleted

Dec 19, 2023

Thanks. I respect the open nature of HF and the mature way leaderboard discussions are handled by its staff despite the often mean-spirited, ignorant and long-winded comments, many of which have come from me (sorry).

However, I still think this won't make any real difference. A voluntary tag and no filtering still means mergers can't be either easily identified (e.g. colored icon next to their name), nor filtered out to focus on the non-merged LLMs, burying the major fine-tunes in a flood of merged/modified LLMs. There's simply no way any of the Mistral 7b mergers should be scored higher than Mixtrals and Llama 2 70bs, let alone several points higher.

Perhaps contamination testing will save the day, but I predict that in the near future you'll be forced to take a more aggressive stance against bad faith actors vying to be #1. But for now adding the the tag was a good start, so thanks.