Detailed results are inconsistent

#734
by sbdzdz - opened

I downloaded the detailed results from the Leaderboard using the following code:

import datasets
import transformers

data = datasets.load_dataset(
    "open-llm-leaderboard/details_meta-llama__Meta-Llama-3-8B-Instruct",
    name="harness_winogrande_5",
    split="latest"
)
tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
df = data.to_pandas()
first_row = df.iloc[0]

I am interested in retrieving the complete data sample. Based on this discussion, I decode the cont_tokens column to get the choices. As a sanity check, I also looked into the example, full_prompt, and input_tokens. However, the results are inconsistent:

>>> print(first_row["example"])
People think _ is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.

>>> print(tokenizer.decode(first_row["cont_tokens"][0]))
is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.

>>> print(tokenizer.decode(first_row["cont_tokens"][1]))
is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.

>>> print(first_row["full_prompt"])
Natalie took basic French lessons from Betty after school because Betty is strong at that language.

My friends tried to drive the car through the alleyway but the car was too wide.

Sarah didn't practice ballet much but Mary practiced all the time. Sarah wasn't chosen to be a lead dancer.

The trainer tried to put the exercise equipment in the van but it wouldn't fit; the van was too small.

Natalie never checks the air in the tires while Tanya does and you just knew Natalie would have flat tires.

People think Rebecca

>>> print(tokenizer.decode(first_row["input_tokens"][0]))
Natalie took basic French lessons from Betty after school because Betty is strong at that language.

My friends tried to drive the car through the alleyway but the car was too wide.

Sarah didn't practice ballet much but Mary practiced all the time. Sarah wasn't chosen to be a lead dancer.

The trainer tried to put the exercise equipment in the van but it wouldn't fit; the van was too small.

Natalie never checks the air in the tires while Tanya does and you just knew Natalie would have flat tires.

People think Samantha is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

>>> print(tokenizer.decode(first_row["input_tokens"][1]))
Natalie took basic French lessons from Betty after school because Betty is strong at that language.

My friends tried to drive the car through the alleyway but the car was too wide.

Sarah didn't practice ballet much but Mary practiced all the time. Sarah wasn't chosen to be a lead dancer.

The trainer tried to put the exercise equipment in the van but it wouldn't fit; the van was too small.

Natalie never checks the air in the tires while Tanya does and you just knew Natalie would have flat tires.

People think Rebecca is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Some surprises:

  • the choices are identical
  • full_prompt and input_tokens differ
  • full_prompt has the blank filled

This is the case for all samples from Winogrande across the few models I tried.

Based on this, I assume the evaluation works as follows:

Given the sentence "People think _ is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.", we prompt the model with:

  1. "People think Samantha "
  2. "People think Rebecca "

and in both cases measure the probability of the continuation: "is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing".

Is this accurate? What would be the best way to get the full example with choices? I'm currently leaning towards matching df["example"] back to the original dataset.

sbdzdz changed discussion title from Detailed results seem broken to Detailed results seem wrong
sbdzdz changed discussion title from Detailed results seem wrong to Detailed results are inconsistent

This is the case for all samples from Winogrande across the few models I tried.

As an aside, I've been wondering whether the values for full_prompt are consistent across models under the same evaluation task. I've been assuming they are. Based on your exploration can you confirm?

For example, the case you've presented is for model Meta-Llama-3-8B-Instruct under winogrande. If I were to look up the winogrande dataset for a different model, I'm wondering whether that dataset would have a "Natalie..." full_prompt that is exactly same. I'm essentially trying to find a unique key for each question. Some datasets have prompt hashes, which in theory could be that key; but only I think if prompts are string-consistent.

Open LLM Leaderboard org

Hi @sbdzdz ,

Based on your examination of the dataset and tokenization inconsistencies, it seems that the evaluation for Winogrande indeed splits the prompts and assesses the likelihood of each continuation independently.

For getting a full example with choices, your approach of matching df["example"] back to the original dataset might indeed be necessary due to how the data is structured and tokenized.

For more details on how models are evaluated, you might find this discussion on the EleutherAI GitHub repository useful: LM Evaluation Harness Issue #978

alozowski changed discussion status to closed

Sign up or log in to comment