open-llm-leaderboard/open_llm_leaderboard · How are eval details formatted?

Jan 29

I was digging intoopen-llm-leaderboard/details_01-ai__Yi-34B, dataset automatically created during the evaluation run of model 01-ai/Yi-34B on the Open LLM Leaderboard.

I used the scripts in README.txt to load the details of this run, like

from datasets import load_dataset
data = load_dataset("open-llm-leaderboard/details_01-ai__Yi-34B_public",
    "harness_winogrande_5",
    split="latest")
print(data[1])

and it shows:

Generating latest split: 1267 examples [00:00, 114566.85 examples/s]
{'choices': [], 'cont_tokens': [[978, 4841, 43163, 98], [978, 4841, 43163, 98]], 'example': 'For her birthday gifts, Sarah was upset with the pearls, but felt the opposite about the rings she received. The _ were fancier.', 'full_prompt': "Ryan has more stress levels in their body than Nelson because Ryan works a very fast paced job.\n\nI removed the water from the pool into the ditch until the ditch was full.\n\nBeing a web developer was great work for Megan but not Maria because Maria hated computers.\n\nDerrick was better at caring for a PICC line than Joseph because Joseph hadn't dealt with a catheter before.\n\nGoing for a swim was bad at the beach not because of the jellyfish but of the garbage since the jellyfish was far away.\n\nFor her birthday gifts, Sarah was upset with the pearls, but felt the opposite about the rings she received. The rings", 'gold': [], 'gold_index': [], 'input_tokens': [[59615, 7806, 815, 863, 5261, 4154, 594, 881, 2534, 989, 22774, 1199, 11909, 2872, 562, 1196, 3615, 56280, 2123, 98, 144, 144, 59597, 6567, 567, 2127, 742, 567, 6504, 1029, 567, 44158, 2167, 567, 44158, 717, 1973, 98, 144, 144, 48376, 562, 3447, 12259, 717, 1392, 932, 631, 49533, 796, 728, 23050, 1199, 23050, 24208, 11942, 98, 144, 144, 59614, 56191, 717, 1665, 702, 19054, 631, 562, 694, 2532, 59608, 1641, 989, 13015, 1199, 13015, 12595, 59610, 59570, 19327, 651, 562, 4519, 43687, 1405, 98, 144, 144, 10098, 583, 631, 562, 9695, 717, 2637, 702, 567, 9137, 728, 1199, 593, 567, 43275, 16984, 796, 593, 567, 20422, 1634, 567, 43275, 16984, 717, 2126, 2072, 98, 144, 144, 4260, 1019, 10496, 13113, 97, 16016, 717, 13595, 651, 567, 29396, 9940, 97, 796, 4107, 567, 8472, 883, 567, 15559, 1105, 3921, 98, 707, 29396, 9940, 978, 4841, 43163, 0, 0, 0, 0, 0], [59615, 7806, 815, 863, 5261, 4154, 594, 881, 2534, 989, 22774, 1199, 11909, 2872, 562, 1196, 3615, 56280, 2123, 98, 144, 144, 59597, 6567, 567, 2127, 742, 567, 6504, 1029, 567, 44158, 2167, 567, 44158, 717, 1973, 98, 144, 144, 48376, 562, 3447, 12259, 717, 1392, 932, 631, 49533, 796, 728, 23050, 1199, 23050, 24208, 11942, 98, 144, 144, 59614, 56191, 717, 1665, 702, 19054, 631, 562, 694, 2532, 59608, 1641, 989, 13015, 1199, 13015, 12595, 59610, 59570, 19327, 651, 562, 4519, 43687, 1405, 98, 144, 144, 10098, 583, 631, 562, 9695, 717, 2637, 702, 567, 9137, 728, 1199, 593, 567, 43275, 16984, 796, 593, 567, 20422, 1634, 567, 43275, 16984, 717, 2126, 2072, 98, 144, 144, 4260, 1019, 10496, 13113, 97, 16016, 717, 13595, 651, 567, 29396, 9940, 97, 796, 4107, 567, 8472, 883, 567, 15559, 1105, 3921, 98, 707, 15559, 978, 4841, 43163, 0, 0, 0, 0, 0, 0]], 'instruction': '', 'metrics': {'acc': True}, 'num_asked_few_shots': 5, 'num_effective_few_shots': 5, 'padded': [5, 6], 'pred_logits': [], 'predictions': [-13.8125, -12.625], 'truncated': [0, 0]}

I am now wondering the meaning of each of the (key, value) pair, especially where to find the prediction, golden answer of this data sample.
Thanks!

clefourrier

Open LLM Leaderboard org Jan 30

Hi!
For Winogrande, we are comparing the log likelihood of 2 possible choices following the example (in example for the text, in input_tokens for the tokenized version).
The choices are, in this case, stored tokenized in cont_tokens.
The logprobs associated with both choices are in predictions.

clefourrier changed discussion status to closed Jan 30