Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

772

[FLAG] Voicelab/trurl-2-13b: training data surely includes the test data, right?

#202

by TNTOutburst - opened Aug 18, 2023

Discussion

TNTOutburst

Aug 18, 2023

There's no way that trurl-2-13b, a 13b model beats the best 70b models on MMLU by FAR. It seems like it might know the MMLU testing data based off its absurdly high score.

ddobokki

Aug 18, 2023

agree

felixz

Aug 18, 2023

They do disclose it in their Model Card:
Training data
The training data includes Q&A pairs from various sources including Alpaca comparison data with GPT, Falcon comparison data, Dolly 15k, Oasst1, Phu saferlfhf, ShareGPT version 2023.05.08v0 filtered and cleaned, Voicelab private datasets for JSON data extraction, modification, and analysis, CURLICAT dataset containing journal entries, dataset from Polish wiki with Q&A pairs grouped into conversations, MMLU data in textual format, Voicelab private dataset with sales conversations, arguments and objections, paraphrases, contact reason detection, and corrected dialogues.

Should probably add a column with datasets contamination warning... Nobody can rationally judge this to be the best 13B model going simply by leaderboard average. @clefourrier

danielhanchen

Aug 18, 2023

The interesting thing is on ARC it gets 60.07, which is 37th for 13B models. The median is around 57.94 and the max is taken up Orca Mini at 63.14.

HelloSwag 80.23, which is 144th, horribly bad amongst 13B models. In fact the median is 81.23, so it did even worse than the median performance. The max is taken up by beaugogh/Llama2-13b-sharegpt4 at 84.53.

MMLU is extremely an outlier 78.59, which dramatically surpasses the max 13B model OpenOrca Platypus which got 59.39. Highly abnormal, and ye as @felixz mentioned, they did mention this in the model card for test data contamination.

Maybe add a column to detect outliers for each parameter size ie do a groupby then 3*std + mean, which for MMLU would have been 73.94, yet that model got 78.59. For skewed distributions, maybe a median + dispersion based approach.

jdarpinian

Aug 19, 2023

Any model that is trained on the test sets should be removed from the leaderboard manually. An automated process to detect it doesn't seem necessary unless it becomes more common.

clefourrier changed discussion title from trurl-2-13b's data surely includes the test data, right? to [FLAG] Voicelab/trurl-2-13b: training data surely includes the test data, right? Aug 22, 2023

clefourrier

Open LLM Leaderboard org Aug 22, 2023

Hi! We introduced a flagging system to make it more obvious to users which models shouldn't be reliably trusted! Thank you all for your interest in this issue!

FLAG: This model has been flagged because it's been trained on the test data: MMLU data.

clefourrier changed discussion status to closed Aug 22, 2023

danielhanchen

Aug 23, 2023

@clefourrier Oh cool great idea on the flagging!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment