Spaces:

HuggingFaceH4
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

746

Benchmarks for GPT-3.5 & GPT-4 for comparison

#145

by mantrakp - opened Jul 29, 2023

Discussion

mantrakp

Jul 29, 2023

Is it possible to also add gpt-3.5 and gpt-4 benchmarks for comparison purpose

deleted

Jul 29, 2023

Hi. A clone of this space includes GPTs 3.5 and 4. It can be found here: https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard

hunkim

Jul 29, 2023

@jaspercatapang Wow. That's wonderful. Thanks!

felixz

Jul 30, 2023

@jaspercatapang That is cool but reproducibility is in question since we have no idea how OpenAI run its benchmarks. I wonder if they can be reproduced with same evaluation script as rest of the leaderboard

deleted

Jul 30, 2023

@felixz agreed, it would be better if they can evaluate them. However, they already declined to run their evaluations on a previous thread emphasizing that this leaderboard is for open LLMs.

But I agree with you, now that we’re reaching a point where open LLMs are reaching proprietary LLM levels, in terms of performance, it is important to have both categories validated.

clefourrier

Hugging Face H4 org Jul 31, 2023

Hi!
We won't add GPT3.5 and GPT4 for 2 reasons: 1) as @jaspercatapang mentionned, this is a leaderboard for Open LLMs. 2) However, our main reason for not including models with closed APIs such as GPT3.5 etc is the well know fact that these models have APIs which change through time, so any evaluation we would do would only be valid on the precise day where we would do it.

This would not give reproducible results,, and reproducibility is very important for us.

clefourrier changed discussion status to closed Jul 31, 2023

gsaivinay

Jul 31, 2023

can you believe that new models are now beating gpt3.5 in average scores?

hunkim

Jul 31, 2023

I think it's possible. It's also true in other leaderboards like https://tatsu-lab.github.io/alpaca_eval/.

Limezero

Aug 2, 2023

•

edited Aug 2, 2023

The rationale that GPT "isn't open" or changes over time makes no sense to me. Half the reason people want this benchmark in the first place is specifically to compare models to ChatGPT and see "whether we're there yet", so to speak. If all we have is OpenAI's self-reporting of how their model performed three months ago, that's still an extremely valuable data point. OpenAI could shut down their entire company tomorrow and take all of their models offline, but the fact that somebody at some point in history created a language model which performed like that gives us an anchor for what's possible.

Furthermore, I see no reason why you couldn't run this same set of benchmarks through their API independently and then mark down the scores as "GPT-4-2308" or "GPT-4-2309", which would get you more objective and useful results. In fact, you could even do it over time to prove that their model changes, and exactly how much.

felixz

Aug 2, 2023

can you believe that new models are now beating gpt3.5 in average scores?

Llama2 base model get 69 on MMLU and similar on Hellaswag. I think the only thing these finetunes improved on is Truthqa.
So yea you can say Llama2 is on level of Chatgpt3.5.
Still all these benchmarks miss instruction following, conversations, and coding abilities so it is hard to make a strong statement.

felixz

Aug 2, 2023

The rationale that GPT "isn't open" or changes over time makes no sense to me. Half the reason people want this benchmark in the first place is specifically to compare models to ChatGPT and see "whether we're there yet", so to speak. If all we have is OpenAI's self-reporting of how their model performed three months ago, that's still an extremely valuable data point. OpenAI could shut down their entire company tomorrow and take all of their models offline, but the fact that somebody at some point in history created a language model which performed like that gives us an anchor for what's possible.

Furthermore, I see no reason why you couldn't run this same set of benchmarks through their API independently and then mark down the scores as "GPT-4-2308" or "GPT-4-2309", which would get you more objective and useful results. In fact, you could even do it over time to prove that their model changes, and exactly how much.

Good arguments. The argument about closed models changing over time is pretty weak. I mean you snapshot the result as you said.
Better argument could be that HF does not really want to pay for benchmark GPT3 or 4. The cost is not trivial and you can get into many thousands of dollars quickly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment