MMLU of ChatGPT/GPT3.5-turbo is 69~70, GSM8K 78.2
#1
by
JosephusCheung
- opened
See MMLU 69.1 GSM8K 78.2
on https://opencompass.org.cn/leaderboard-llm updated:2023/9/1, and MMLU scoring 70 from other sources.
JosephusCheung
changed discussion title from
MMLU of ChatGPT/GPT3.5-turbo is 69~70
to MMLU of ChatGPT/GPT3.5-turbo is 69~70, GSM8K 78.2
Our MMLU and GSM8k results come from Chain-of-Thought Hub
We use the same prompts and answer matching as Chain-of-Thought Hub, so the comparison should be fair.
Model | # Params | Average | MT-Bench | AGIEval | BBH MC | TruthfulQA | MMLU | HumanEval | BBH CoT | GSM8K |
---|---|---|---|---|---|---|---|---|---|---|
OpenChat-3.5 | 7B | 61.6 | 7.81 | 47.4 | 47.6 | 59.1 | 64.3 | 55.5 | 63.5 | 77.3 |
ChatGPT (Yours) | ? | 61.5 | 7.94 | 47.1 | 47.6 | 57.7 | 67.3 | 48.1 | 70.1 | 74.9 |
ChatGPT (Other Sources*) | ? | 65.3 | 7.94 | 47.1 | 47.6 | 57.7 | 69.1* | 73.2* | 70.1 | 78.2* |
Thank you for your interest in our results. As you've rightly pointed out, the performance of ChatGPT has evolved over time, and there are numerous reports from different time periods. For a clearer comparison, our reported results are based on the data available around March, which we label as ChatGPT (March), sourced from Chain-of-Thought Hub and OpenAI's technical report.
imone
changed discussion status to
closed