Is it realistic that GSM8K would fail so extremely on a Solar-based merge?

#670
by ABX-AI - opened

image.png

I'm very surprised to see the GSM8K results go so low (basically shaved off 10 points off the average). Upon trying questions from the dataset manually, it answered accurately on all questions I tried.

Any idea why the eval failed so bad here?

Hugging Face H4 org

Most likely a formatting issue, but you can check the detailed generations in details

clefourrier changed discussion status to closed

Yeah that's the thing, the details are showing all answers in the right format. I don't understand how it could score that low. Everything else is in the norm. Solar itself scores normally, and this is a solar merge. It's not making any sense, but thanks.

I ran some of the answers through GPT-4 to compare answers. The answers were all matching the ones from my merges, correctly. However, there is always some last question where there is no output at all.

It's not a formatting issue, there is something wrong with how the eval was executed in my opinion. This is not a realistic fail. I also asked many of these questions manually to the models and they answer correctly.
You closed the issue too hastily @clefourrier

Details look like this otherwise:

Question: Wendy's truck has a gas tank that can hold 20 gallons. She also has a car with a gas tank that holds 12 gallons. The truck's tank is half full. The car's tank is 1/3 full. If she fills them both up completely, how many gallons does she add?
Answer: The truck tank has 10 gallons in it because 20 x .5 = <<20*.5=10>>10
The car tank has 4 gallons in it because 12 x (1/3) = <<12*(1/3)=4>>4
She needs to add 10 gallons to the truck because 20 - 10 = <<20-10=10>>10
She needs to add 8 gallons to the car because 12 - 4 = <<12-4=8>>8
She needs to add 18 gallons in total because 10 + 8 = <<10+8=18>>18

18

Question: Jean is a customer service rep and answered 35 phone calls on Monday. On Tuesday, she answered 46 and took 27 calls on Wednesday. On Thursday she answered 61 calls and finished off answering 31 calls on Friday. What’s the average number of calls she answers per day?
Answer: During the week she answered 35 on Mon, 46 on Tue, 27 on Wed, 61 on Thurs and 31 on Fri for a total of 35+46+27+61+31 = <<35+46+27+61+31=200>>200 calls
She answered 200 calls over 5 days so on average, she answered 200/5 = <<200/5=40>>40 calls a day

40

All questions except the last one have output and it is correct from what I checked.

Hugging Face H4 org

Hi again @ABX-AI !

Feel free to reopen if needed!

Some things which could help:

  • I don't know if you are aware of this, but GSM8K is expecting answers in a very specific format - if it's not supported by your model or not applied correctly, answers will be counted as incorrect
  • you could compute the average accuracy from the details, as you have all the scores with all the answers, it will allow us to see if there is a mismatch between the average in the leaderboard and the average in the details (make sure to look at the latest results).
  • more generally, you can also try to reproduce our results by following the steps in the About tab of the leaderboard.

I simply compared the data from models scoring 60+ points and saw the exact same formatting. My "success" details also had the same formatting as all fails. So this is why it seems weird, however I am not a specialist at this at all, I am simply comparing successful eval formatting with mine, and checking random answers from random test runs on my evals, and they seem to be correct.

What I did was remove GSM8K score from my results, and did a new average. Unsurprisingly, it's only slightly lower than SOLAR-10.7B instruct v0.1 which makes complete sense, so this is why I assume the failure is faulty for some reason. Solar-based model should be able to handle this and score 70+ overall avg.

The LB itself has been restarting itself and launching with runtime errors, do you think it could have lead to this test being incomplete or faulty for that reason? I cannot really re-run the models so I don't think I'll be able to easily re-try it.

Hugging Face H4 org

Hi!

If you recompute the average score on GSM8K from the details, we'll be able to see precisely what worked or not. Did you look at the actual metric for each sample in the details?

The LB itself has been restarting itself and launching with runtime errors, do you think it could have lead to this test being incomplete or faulty for that reason? I cannot really re-run the models so I don't think I'll be able to easily re-try it.

No, this is completely unrelated, the frontend works separately from the backend.

Hi!

If you recompute the average score on GSM8K from the details, we'll be able to see precisely what worked or not. Did you look at the actual metric for each sample in the details?

The LB itself has been restarting itself and launching with runtime errors, do you think it could have lead to this test being incomplete or faulty for that reason? I cannot really re-run the models so I don't think I'll be able to easily re-try it.

No, this is completely unrelated, the frontend works separately from the backend.

Thank you for the information. Maybe I'm reading the information wrong (I'm using a Parquet Viewer). Is the actual model answer contained in the "full prompt" column? Or the "predictions"? In predictions, I actually do see a lot of repetition and broken responses. If that is the actual answer then the failure was indeed due to some issue with the output - which I am not experiencing at all when testing these models manually with the same questions. They do not repeat themselves like that or produce broken responses. I guess I should find a way to run this with specific sampler settings if this is what happened. And thanks again for taking the time!

Hugging Face H4 org
edited 28 days ago

full_prompt is the input prompt, and predictions is the model output after the prompt, so it would seem you found your problem :)

No problem, happy it helped!

Thanks for your help on this, I appreciate it. In that case I apologize for the thread. It hasn't done well with the output, but I'm not sure how the normal Solars did it. When I prompt it in a clean way without extra instructions, this is a non-issue, it responds very much like the prompt examples. Shouldn't there be some kind of option to use the correct instruction format for such cases? Otherwise it seems like an unfair rating system as either model fits the test, or it just fails, even though it is capable of answering correctly when tested in actual practice. However, again I'm no specialist, just feels strange that it would output such garbage during the test but does fine otherwise in real usage.

Hugging Face H4 org

Adding the option to take into account system prompts/instruction formats and so forth is an extension we'll add to the leaderboard :)
The leaderboard, for now, is a good way to 1) rank pretrained models 2) get an idea of the performance of community fine-tunes/merges/... in comparison with the associated original models (for example, you can compare an official mistral finetune with your finetune)

Adding the option to take into account system prompts/instruction formats and so forth is an extension we'll add to the leaderboard :)
The leaderboard, for now, is a good way to 1) rank pretrained models 2) get an idea of the performance of community fine-tunes/merges/... in comparison with the associated original models (for example, you can compare an official mistral finetune with your finetune)

That's awesome, I'm happy to hear that :)) The LB is absolutely useful, and obviously this is a problem that doesn't happen with many of the models, maybe just specific ones trained on specific instructions. The merges I tested have alpaca/vicuna preference and maybe that's the reason, not sure. It's a bit crazy how many variables are involved, so I understand it's not an ideal scenario for every case but you seem to be headed in the right direction already with your plans. I've had similar issues with an RP leaderboard by chaiverse with 11B models since the instructions ruin them and take them to the bottom. In real experience with the right format, the models behave far differently.

Thanks again for your time and the clarifications!

GSM8K requires the ##### format in the LLM's answer to correctly count the answer.

Recently, I fine-tuned the Metamath model with additional new math datasets that were not available when Metamath released their model. It achieved 3 points on the leaderboard due to strict requirements for GSM8K.

There is something called flexible-extract in LM evaluation harness. If you really want to know the model's performance, you should check that (for example, my model was answering many questions correctly, but because of the #### requirement, it didn't count as an exact score; however, it counted in the flexible-extract section).

More details with an example here:

@clefourrier IMO, it may be considered to base flexible-extract in the leaderboard instead of strict-match. Many people (including me) may don't want to fine-tune their models to include #### just to appear good in the evaluations.

I will be very happy to discuss this issue in a new or different discussion too :)

Have a nice day everyone :)

GSM8K requires the ##### format in the LLM's answer to correctly count the answer.

Recently, I fine-tuned the Metamath model with additional new math datasets that were not available when Metamath released their model. It achieved 3 points on the leaderboard due to strict requirements for GSM8K.

There is something called flexible-extract in LM evaluation harness. If you really want to know the model's performance, you should check that (for example, my model was answering many questions correctly, but because of the #### requirement, it didn't count as an exact score; however, it counted in the flexible-extract section).

More details with an example here:

@clefourrier IMO, it may be considered to base flexible-extract in the leaderboard instead of strict-match. Many people (including me) may don't want to fine-tune their models to include #### just to appear good in the evaluations.

I will be very happy to discuss this issue in a new or different discussion too :)

Have a nice day everyone :)

Thank you for sharing this, I had presumed it was already doing the smart extraction at first (because the format requirement is nonsensically strict on its own). I think it's a great idea to put the extractor in there as a "fix" for this, but also, there is something wrong with how the GSM8K gets prompted because the answers were full of garbage and repetition, something that simply never happens in my tests with these models. The extractor won't help much if the original generation is broken. But it's broken only with this type of test for some reason. It needs an overhaul, or something, in the way it does the prompts to begin with.

It's probably worth making a suggestion thread with that specifically as well for visibility @Weyaxi

Oh, then there is another problem I think. I will probably open a diffrent thread for this like you suggested.

Hugging Face H4 org

Hi!

@Weyaxi We provide GSM8K in few-shot, and expect high quality models to be able to follow a prompt template, which is usually not a problem.
I agree that the GSM8K format is specific, but that is precisely why we are providing format examples first (using few-shot).

We won't change the way that models are evaluated for now, and I'm unsure the flexible extract is even present in the harness version we use iirc (I think it was added later) - updating this scoring would require us to re-evaluate more than 6K models, and we will not do so before our next leaderboard update.

@ABX-AI Regarding your comment, we provide absolutely all the details of the generations, so you can compare by yourself what your model generates from the exact same full_prompts (minus the possible truncation which I think we report too).

@clefourrier I am using the details in particular (and it's quite awesome that you provide that btw, thanks for that). The merge I'm trying is answering fine. I'm trying it out in LM Studio, with default sampler settings and Alpaca preset. Nothing else changed.

Here are a few failed examples:

[Here, it answered with just "Tim" in the eval]
image.png

[Here, it answered with just "Jordan" in the eval]
image.png

[Here, it answered with "[Jared starts with a speed of 47 WPM.
He increases it to 5 6 WPM.
The average of the three measurements is 47+5+6=28/2=14/2=1/2=1/2=1/2=2/2=1/2=2/2=1/2=2/2=1/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2=2/2/2/2/2/2/2/2/2/2/2/2/2/2/2/2/2/2/2/2/2]

image.png

This is on a Q6_K quant, by the way. Not even full precision. All answers are 1 attempt only. I did not see it answer the way it did on the details data even once so far. And, not just that, but the answers it gives me are correct from the first shot. I'm sure it will make mistakes, I saw it made a mistake once, but to output such broken output - not once. Formatting is one issue, and then how the samplers are configured must be wrong. How come I am getting only real and normal responses on default Alpaca preset in LMS without changing anything else, but the eval gets these garbage responses 99% of the time?

I cannot in all fairness consider this eval to be adequate, because this model is obviously far from broken. Otherwise, I would be getting the same garbage as eval shows, and that simply isn't happening.

How does the formatting requirement make any sense is beyond me, it serves to fail models that are actually answering correctly. Using an answer extractor would be the obvious right way to go about it. Maybe there should be another test specifically for formatting, separate from math questions.

Thanks again for taking the time to review the feedback!

PS: A note on formatting, I know this is obviously not quite relevant to the sterile environment of these eval setups, but I do have a character JSON card that I use in some frontends, which contains example conversations of specific formatting. Every response contains [Analysis:...] and then [Output:...]. The model follows that very consistently when prompted like that, so I know it's not that bad with formatting when instructed with a character card with examples.

Hugging Face H4 org

Regarding json output, we actually have a blog post coming up on something similar soon, with partners :)

For the difference, I think that it could be due to the default generation parameters in the harness, which might have a lower temperature than what you are using. Did you try reproducing (even on some samples, like the 10 first) with the same command as us, using the HF model?

Sadly, I cannot load this model in HF format. My only option is GGUF with offloading, I'm constrained by VRAM. :/ Based on common sense, a bitcrushed quant should perform a bit worse, not better, so I think at worst, it should only give better responses in HF format.

I am doing the manual tests on Q6_K quant, baseline Alpaca LM Studio settings:

image.png

Very interested to see the blog post when you release it, thanks a lot for your work as well! I don't want to sound like I'm diminishing it, this LB is amazing, even if we consider benchmarks in isolation aren't ideal, this is true beyond the LB itself, after all, it's not a perfect method to rate all models.

Sign up or log in to comment