Spaces:
Running
on
CPU Upgrade
Detailed results are inconsistent
I downloaded the detailed results from the Leaderboard using the following code:
import datasets
import transformers
data = datasets.load_dataset(
"open-llm-leaderboard/details_meta-llama__Meta-Llama-3-8B-Instruct",
name="harness_winogrande_5",
split="latest"
)
tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
df = data.to_pandas()
first_row = df.iloc[0]
I am interested in retrieving the complete data sample. Based on this discussion, I decode the cont_tokens
column to get the choices. As a sanity check, I also looked into the example
, full_prompt
, and input_tokens
. However, the results are inconsistent:
>>> print(first_row["example"])
People think _ is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.
>>> print(tokenizer.decode(first_row["cont_tokens"][0]))
is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.
>>> print(tokenizer.decode(first_row["cont_tokens"][1]))
is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.
>>> print(first_row["full_prompt"])
Natalie took basic French lessons from Betty after school because Betty is strong at that language.
My friends tried to drive the car through the alleyway but the car was too wide.
Sarah didn't practice ballet much but Mary practiced all the time. Sarah wasn't chosen to be a lead dancer.
The trainer tried to put the exercise equipment in the van but it wouldn't fit; the van was too small.
Natalie never checks the air in the tires while Tanya does and you just knew Natalie would have flat tires.
People think Rebecca
>>> print(tokenizer.decode(first_row["input_tokens"][0]))
Natalie took basic French lessons from Betty after school because Betty is strong at that language.
My friends tried to drive the car through the alleyway but the car was too wide.
Sarah didn't practice ballet much but Mary practiced all the time. Sarah wasn't chosen to be a lead dancer.
The trainer tried to put the exercise equipment in the van but it wouldn't fit; the van was too small.
Natalie never checks the air in the tires while Tanya does and you just knew Natalie would have flat tires.
People think Samantha is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>> print(tokenizer.decode(first_row["input_tokens"][1]))
Natalie took basic French lessons from Betty after school because Betty is strong at that language.
My friends tried to drive the car through the alleyway but the car was too wide.
Sarah didn't practice ballet much but Mary practiced all the time. Sarah wasn't chosen to be a lead dancer.
The trainer tried to put the exercise equipment in the van but it wouldn't fit; the van was too small.
Natalie never checks the air in the tires while Tanya does and you just knew Natalie would have flat tires.
People think Rebecca is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Some surprises:
- the choices are identical
full_prompt
andinput_tokens
differfull_prompt
has the blank filled
This is the case for all samples from Winogrande across the few models I tried.
Based on this, I assume the evaluation works as follows:
Given the sentence "People think _ is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.", we prompt the model with:
- "People think Samantha "
- "People think Rebecca "
and in both cases measure the probability of the continuation: "is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing".
Is this accurate? What would be the best way to get the full example with choices? I'm currently leaning towards matching df["example"]
back to the original dataset.
This is the case for all samples from Winogrande across the few models I tried.
As an aside, I've been wondering whether the values for full_prompt
are consistent across models under the same evaluation task. I've been assuming they are. Based on your exploration can you confirm?
For example, the case you've presented is for model Meta-Llama-3-8B-Instruct under winogrande. If I were to look up the winogrande dataset for a different model, I'm wondering whether that dataset would have a "Natalie..." full_prompt
that is exactly same. I'm essentially trying to find a unique key for each question. Some datasets have prompt hashes, which in theory could be that key; but only I think if prompts are string-consistent.
Hi @sbdzdz ,
Based on your examination of the dataset and tokenization inconsistencies, it seems that the evaluation for Winogrande indeed splits the prompts and assesses the likelihood of each continuation independently.
For getting a full example with choices, your approach of matching df["example"]
back to the original dataset might indeed be necessary due to how the data is structured and tokenized.
For more details on how models are evaluated, you might find this discussion on the EleutherAI GitHub repository useful: LM Evaluation Harness Issue #978