HuggingFaceH4/open_llm_leaderboard · Question about "until token" of huggingface inference method.

kimdeokgi

27 days ago

•

edited 27 days ago

During gsm8k inference, generation is truncated at the ":" character.

[For example]
actual reasoning

First, we need to find the third typing speed, which is 52 + 5 = 57 WPM.\nTo find the average, we add the three speeds and divide by 3: (47 + 52 + 57) / 3 = 156 / 3 = 52 WPM.\n#### 52

HuggingFace reasoning

First, we need to find the third typing speed, which is 52 + 5 = 57 WPM.\nTo find the average, we add the three speeds and divide by 3:

Even though it is correct, Since the creation is truncated, the answer is treated as incorrect.

As a result of code analysis, "until token" is as follows.
":", "Question:", "Question"

Why was “:” designated as "until token"?
If there is no problem, I wonder how I can exclude it. please.

clefourrier

Hugging Face H4 org 27 days ago

Hi!
This is the original design as the implementation of GSM8K in the Eleuther AI Harness, when we started the leaderboard. Our goal being to provide entirely reproducible evaluations, we used it as is, with the good and less good, so anyone could reproduce our results precisely.

However, it has since then been changed in the Harness, and we will update this aspect of the scoring in our next iteration of the leaderboard (coming in a month or so) - but for now we can't re-run more than 6K models on this evaluation at once, as we don't have the compute.

clefourrier changed discussion status to closed 27 days ago