Too many zeros for GSM8K, eval prompt is not suitable for CHAT models.

#360
by JosephusCheung - opened

The median of GSM8K is:
image.png
That's not fair. It should be except from average.

and DROP:

image.png

If this is what you're expecting, then I think you're probably encouraging the training input-output format of benchmark, on both BASE and CHAT models.

Open LLM Leaderboard org

Hi!
I'm not entirely sure what you are asking, but I wanted to clarify that we deliberately added way harder evaluations to the leaderboard, in order to make it more relevant to study the state of LLMs, as the field is progressing very rapidly.
The fact that some/many models are performing badly on DROP and GSM8K is not a bug but a feature - they allow to discriminate way better against models between those who are ok and those who actually are good.

Be like me:

  1. Look for a model.
  2. Do not care if it knows maths.
  3. Remove GSM8K from columns
  4. Expect the average to be calculated from the remaining columns.

Reality: Average still includes score from GSM8K.

(Obviously, it would be nice if a model has it all, but sometimes it just isn't a requirement)

The score is highly related to the format of input, for example those chat models may fail all the tests, if you use unexpected formats, as they're overfitting on formats of expected inputs in their SFT datasets.

Open LLM Leaderboard org

This raises an interesting question, though: should we consider that models are "good" when they can only succeed at a given task if prompted with a very specific formulation?
Don't we want our models to be, in the end, better than that?

However, philosophy aside, we agree that in the current state of LLMs, system prompts are quite important! We're going to work on adding them quite soon :)

Why do Opencampass, AlpacaEval, MT-Bench allow models to work on custom templates, but this LLM bench refuses to do so? The model should always follow the expected input format, even for GPT-4, the API we use is some kind of chatml-like format.
To some specific models, even w/o [BOS] make a difference, not to say chat templates.

I do respect your work, but there are already people training pointlessly for this format matter.

This raises an interesting question, though: should we consider that models are "good" when they can only succeed at a given task if prompted with a very specific formulation?
Don't we want our models to be, in the end, better than that?

However, philosophy aside, we agree that in the current state of LLMs, system prompts are quite important! We're going to work on adding them quite soon :)

@clefourrier I think models should be evaluated based on how they are deployed. For example, OpenAI always applies conversation templates to ChatGPT, and we also apply it when using chat models. From this perspective, we should abandon perplexity-based evaluation (at most times logits are hidden from users) and add correct templates. Most benchmark setups have the model generate answers and match option letters, such as AGIEval. Additionally, FastEval is also a good evaluation suite that follows these principles.

@clefourrier By the way, I have some ideas for the next version of the HF leaderboard.

(1) Allow submission of OpenAI compatible API endpoints and run eval through the API, or allow users to upload model answers, such as AlpacaEval, and publish the answers on leaderboard.
(2) Allow users to set templates and parameters such as precision and generation length when submitting.

It can save a lot of computing resources as evaluation is carried out in a distributed manner. Besides, I don't think it'll bring more concerns about cheating since training on benchmark test sets is easy as well, and we can make the model answers public and let the community check them.

Open LLM Leaderboard org
edited Nov 12, 2023

Why do Opencampass, AlpacaEval, MT-Bench allow models to work on custom templates, but this LLM bench refuses to do so? The model should always follow the expected input format, even for GPT-4, the API we use is some kind of chatml-like format.
To some specific models, even w/o [BOS] make a difference, not to say chat templates.

Hi @JosephusCheung thank you for your concerns, the Open LLM Leaderboard is using Eleuther AI Eval Harness as backend. It does not allow custom prompts. We will, however, add it in the future, keep in mind that this would be a lot of work as we would potentially re-run many models for fairness reasons, and we do not have an ETA yet.

Open LLM Leaderboard org

(1) Allow submission of OpenAI compatible API endpoints and run eval through the API, or allow users to upload model answers, such as AlpacaEval, and publish the answers on leaderboard.

Our philosophy is to evaluate models that are available in transformers, allowing submissions of OAI compatible endpoints is therefore not in the work. Moreover, we only display results from the bench we evaluate on the Leaderboard, but I agree that it is a good idea to add a leaderboard compiling results from many types of benchmarks.

(2) Allow users to set templates and parameters such as precision and generation length when submitting.

It can save a lot of computing resources as evaluation is carried out in a distributed manner. Besides, I don't think it'll bring more concerns about cheating since training on benchmark test sets is easy as well, and we can make the model answers public and let the community check them.

Users can already submit model precision. But for the chat template it is a good idea and, as I said above, it is in our backlog :)

Open LLM Leaderboard org

Closing this issue for now, as @SaylorTwift answered quite in depth about the state we are at. To sum up his already excellent answers:

  • we won't add support for closed sourced models or APIs (this is already explained in more detail in the FAQ, by the way)
  • we'll do our best to add system prompts/chat templates in the near future, as we agree that they would be very useful for the community.
clefourrier changed discussion status to closed

But for current average scores, it is not fair for some chat models, as they should have much better scores on things like GSM8K with a proper format of input.
I suggest you consider a weighted average for now to reduce the temptation for people to finetune their models to a specific format for benchmark only.

Sign up or log in to comment