Scores of GPT3.5 and GPT4 for comparison

#30
by gsaivinay - opened

Hello,

Wondering if it is possible to evaluate and include GPT3.5 and GPT4 scores in this benchmark, so that we'll have some comparison for open source models vs OpenAI.

Here are the benchmarks that were included in the "Technical Report":
image.png
It's not surprising that they dominate in multi-shot since they have much larger context lengths. We don't know exactly how their evaluation setup compares to this one so it might still be worth running evals using the api.

In their code:
gpt4_values = {
"Model": f'gpt4',
"Revision": "tech report",
"8bit": None,
"Average ⬆️": 84.3,
"ARC (25-shot) ⬆️": 96.3,
"HellaSwag (10-shot) ⬆️": 95.3,
"MMLU (5-shot) ⬆️": 86.4,
"TruthfulQA (0-shot) ⬆️": 59.0,
}
gpt35_values = {
"Model": f'gpt3.5',
"Revision": "tech report",
"8bit": None,
"Average ⬆️": 71.9,
"ARC (25-shot) ⬆️": 85.2,
"HellaSwag (10-shot) ⬆️": 85.5,
"MMLU (5-shot) ⬆️": 70.0,
"TruthfulQA (0-shot) ⬆️": 47.0,
}

Are the scores not included because they were not verified? Would love to see 3.5 on there as a control or comparison.

Ah! These are the numbers I was looking for. Thankyou for sharing.

Open LLM Leaderboard org
edited Jul 3, 2023

These scores are not included because the GPT3.5 and 4 models are not open and it is an "Open LLM Leaderboard" after all :)

clefourrier changed discussion status to closed

it is an "Open LLM Leaderboard" after all :)

Is there a reason not to include closed source models in the evals/leaderboard? In the EleutherAI lm-evaluation-harness it mentions support for OpenAI.

I would personally find it useful to have a comparison between all leading models (both open and closed source), to be able to make design/implementation tradeoff decisions.

It would also be helpful to access the raw eval output for a given row in the table to be able to dig into the raw model outputs, but that's nice to have I guess.

Open LLM Leaderboard org

Hi! The main reason is that this is a leaderboard for Open models, both for philosophical reasons (openness is cool) and for practical reasons: we want to ensure that the results we display are accurate and reproducible, but 1) commercial closed models can change their API thus rendering any scoring at a given time incorrect 2) we re-run everything on our cluster to ensure all models are run on the same setup and you can't do that for models such as OpenAI's

Reading GPT-4 Report may help you get in deep. https://cdn.openai.com/papers/gpt-4.pdf page 7

if any body just want to have quick comparison of GPT 3.5 and GPT 4 models along with the other open LLMs, I've cloned this space and made those OpenAI models to be visible.

https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard

also added count of models in the queue and finished status

image.png

Open LLM Leaderboard org

@gsaivinay do you want to open a PR on the leaderboard to add count of models in the queue?

@gsaivinay do you want to open a PR on the leaderboard to add count of models in the queue?

Sure, will open a PR

@clefourrier Hello,
How did you know the scores for GPT3.5 and GPT4? Is it from the technical paper?

Thanks a lot.

gpt4_values = {
"Model": f'gpt4',
"Revision": "tech report",
"8bit": None,
"Average ⬆️": 84.3,
"ARC (25-shot) ⬆️": 96.3,
"HellaSwag (10-shot) ⬆️": 95.3,
"MMLU (5-shot) ⬆️": 86.4,
"TruthfulQA (0-shot) ⬆️": 59.0,
}
gpt35_values = {
"Model": f'gpt3.5',
"Revision": "tech report",
"8bit": None,
"Average ⬆️": 71.9,
"ARC (25-shot) ⬆️": 85.2,
"HellaSwag (10-shot) ⬆️": 85.5,
"MMLU (5-shot) ⬆️": 70.0,
"TruthfulQA (0-shot) ⬆️": 47.0,
}
Open LLM Leaderboard org

@wonhosong These numbers do come from the technical paper - however they are not necessarily reported for the same number of shots as us.

I've done it manually, i think the number are right, idk

here it is:

ARC 25 shot 96.3
Helleswage 10 shot 95.3
mmlu 5 shot 86.5
TruthfulQA 0 shot 59
Winogrande 5 shot 87.5
GSM8K 5 shot 97
DROP 3 shot 80.9

Avg gpt-4: 86.07 avg

ARC 25 shot 85.2
Helleswage 10 shot 85.5
mmlu 5 shot 70
TruthfulQA 0 shot 47.0
Winogrande 5 shot 81.6
GSM8K 5 shot 57.1
DROP 3 shot 61.4

Avg gpt3.5: 69.68

It's easier to compare different models against GPT4 using the model card available on LLM Explorer. See, for example, this link: https://llm.extractum.io/model/mistralai%2FMixtral-8x7B-Instruct-v0.1,8t3fi9hMpQjLjo3YWPGwQ.

Is anyone aware of the detail benchmark results for the openAI models for comparison (what you get when setting the log_samples parameter and what we have in the Open LLM Leaderboard details, e.g. https://huggingface.co/datasets/open-llm-leaderboard/details_abacusai__Smaug-72B-v0.1/tree/main/2024-02-04T04-59-32.876763)?

Sign up or log in to comment