🏆 LLM-Leaderboard

A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!

Interactive Dashboard

https://llm-leaderboard.streamlit.app/

Leaderboard

Model Name	Commercial Use?	Chatbot Arena Elo	HellaSwag (few-shot)	HellaSwag (zero-shot)	HumanEval-Python (pass@1)	LAMBADA (zero-shot)	MMLU (zero-shot)	MMLU (few-shot)	TriviaQA (zero-shot)
alpaca-13b	no	1008
bloom-176b	yes				0.155
cerebras-gpt-7b	yes			0.636		0.636	0.259		0.141
cerebras-gpt-13b	yes			0.635		0.635	0.258		0.146
chatglm-6b	yes	985
chinchilla-70b	no			0.808		0.774		0.675
code-cushman-001	no				0.335
code-davinci-002	yes				0.658
codegen-16B-mono	yes				0.293
codegen-16B-multi	yes				0.183
codegx-13b	no				0.229
codex-12b	no				0.288			0.685
dolly-v2-12b	yes	944
eleuther-pythia-7b	yes			0.667		0.667	0.265		0.198
eleuther-pythia-12b	yes			0.704		0.704	0.253		0.233
fastchat-t5-3b	yes	951
gal-120b	no						0.526
gpt-3-175b	yes		0.793	0.789				0.439
gpt-3.5-175b	yes		0.855		0.481	0.762		0.700
gpt-4	yes		0.953		0.670			0.864
gpt-neox-20b	yes			0.719		0.719	0.269	0.336	0.347
gpt-j-6b	yes			0.683		0.683	0.261		0.234
koala-13b	no	1082
llama-7b	no			0.738	0.105	0.738	0.302		0.443
llama-13b	no	932		0.792	0.158
llama-33b	no			0.828	0.217
llama-65b	no			0.842	0.237			0.634
mpt-7b	yes			0.761		0.702	0.296		0.343
oasst-pythia-12b	yes	1065
opt-7b	no			0.677		0.677	0.251		0.227
opt-13b	no			0.692		0.692	0.257		0.282
palm-540b	no		0.838	0.834	0.262	0.779		0.693
replit-code-v1-3b	yes				0.219
stablelm-base-alpha-7b	yes			0.533		0.533	0.251		0.049
stablelm-tuned-alpha-7b	no	858
starcoder-base-16b	yes				0.304
starcoder-16b	yes				0.336
starcoder-16b (prompted)	yes				0.408
vicuna-13b	no	1169

Benchmarks

Benchmark Name	Author	Link	Description
Chatbot Arena Elo	LMSYS	https://lmsys.org/blog/2023-05-03-arena/	"In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/)
HellaSwag	Zellers et al.	https://arxiv.org/abs/1905.07830v1	"HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)
HumanEval	Chen et al.	https://arxiv.org/abs/2107.03374v2	"It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval)
LAMBADA	Paperno et al.	https://arxiv.org/abs/1606.06031	"The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada)
MMLU	Hendrycks et al.	https://github.com/hendrycks/test	"The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")
TriviaQA	Joshi et al.	https://arxiv.org/abs/1705.03551v2	"We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2)

How to Contribute

We are always happy for contributions! You can contribute by the following:

table work (don't forget the links):
- filling missing entries
- adding a new model as a new row to the leaderboard. Please keep alphabetic order.
- adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
code work:
- improving the existing code
- requesting and implementing new features

Future Ideas

add model year
add "export current view as .csv" button to streamlit demo
(TBD) add model details:
- #params
- #tokens seen during training
- length context window
- architecture type (transformer-decoder, transformer-encoder, transformer-encoder-decoder, ...)
if additional model details, allow to hide them in the interactive streamlit dashboard with a checkbox?
(TBD) improvements on the filtering in the streamlit demo, maybe filter by value range?

More Open LLMs

If you are interested in an overview about open llms for commercial use and finetuning, check out the open-llms repository.

Sources

The results of this leaderboard are collected from the individual papers and published results of the model authors. For each reported value, the source is added as a link.

Special thanks to the following pages:

Disclaimer

Above information may be wrong. If you want to use a published model for commercial use, please contact a lawyer.