Ludwig Stumpp commited on
Commit
fe8088e
1 Parent(s): ea40e33

Starting to add PaLM2 benchmark results

Browse files
Files changed (1) hide show
  1. README.md +51 -44
README.md CHANGED
@@ -8,50 +8,56 @@ https://llm-leaderboard.streamlit.app/
8
 
9
  ## Leaderboard
10
 
11
- | Model Name | Publisher | Commercial Use? | Chatbot Arena Elo | HellaSwag (few-shot) | HellaSwag (zero-shot) | HumanEval-Python (pass@1) | LAMBADA (zero-shot) | MMLU (zero-shot) | MMLU (few-shot) | TriviaQA (zero-shot) |
12
- | ----------------------------------------------------------------------------------------------------------- | ------------------- | --------------- | ------------------------------------------------ | -------------------------------------------------------------------- | --------------------------------------------- | ------------------------------------------------------------------------------- | --------------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------- | --------------------------------------------- |
13
- | [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html) | Stanford | no | [1008](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
14
- | [bloom-176b](https://huggingface.co/bigscience/bloom) | BigScience | yes | | [0.744](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | [0.155](https://huggingface.co/bigscience/bloom#results) | | [0.299](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | |
15
- | [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B) | Cerebras | yes | | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | [0.259](https://www.mosaicml.com/blog/mpt-7b) | | [0.141](https://www.mosaicml.com/blog/mpt-7b) |
16
- | [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B) | Cerebras | yes | | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | [0.258](https://www.mosaicml.com/blog/mpt-7b) | | [0.146](https://www.mosaicml.com/blog/mpt-7b) |
17
- | [chatglm-6b](https://chatglm.cn/blog) | ChatGLM | yes | [985](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
18
- | [chinchilla-70b](https://arxiv.org/abs/2203.15556v1) | DeepMind | no | | | [0.808](https://arxiv.org/abs/2203.15556v1) | | [0.774](https://arxiv.org/abs/2203.15556v1) | | [0.675](https://arxiv.org/abs/2203.15556v1) | |
19
- | [codex-12b / code-cushman-001](https://arxiv.org/abs/2107.03374) | OpenAI | yes | | | | [0.317](https://crfm.stanford.edu/helm/latest/?group=targeted_evaluations) | | | | |
20
- | [code-davinci-002](https://arxiv.org/abs/2207.10397v2) | OpenAI | yes | | | | [0.658](https://arxiv.org/abs/2207.10397v2) | | | | |
21
- | [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | Salesforce | yes | | | | [0.293](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
22
- | [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi) | Salesforce | yes | | | | [0.183](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
23
- | [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/) | Tsinghua University | no | | | | [0.229](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
24
- | [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) | Databricks | yes | [944](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
25
- | [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b) | EleutherAI | yes | | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | [0.265](https://www.mosaicml.com/blog/mpt-7b) | | [0.198](https://www.mosaicml.com/blog/mpt-7b) |
26
- | [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b) | EleutherAI | yes | | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | [0.253](https://www.mosaicml.com/blog/mpt-7b) | | [0.233](https://www.mosaicml.com/blog/mpt-7b) |
27
- | [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) | lmsys.org | yes | [951](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
28
- | [gal-120b](https://arxiv.org/abs/2211.09085v1) | lmsys.org | no | | | | | | [0.526](https://paperswithcode.com/paper/galactica-a-large-language-model-for-science-1) | | |
29
- | [gpt-3-7b / curie](https://arxiv.org/abs/2005.14165) | OpenAI | yes | | [0.682](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | | [0.243](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | |
30
- | [gpt-3-175b / davinci](https://arxiv.org/abs/2005.14165) | OpenAI | yes | | [0.793](https://arxiv.org/abs/2005.14165) | [0.789](https://arxiv.org/abs/2005.14165) | | | | [0.439](https://arxiv.org/abs/2005.14165) | |
31
- | [gpt-3.5-175b / text-davinci-003](https://arxiv.org/abs/2303.08774v3) | OpenAI | yes | | [0.822](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | [0.481](https://arxiv.org/abs/2303.08774v3) | [0.762](https://arxiv.org/abs/2303.08774v3) | | [0.569](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | |
32
- | [gpt-3.5-175b / code-davinci-002](https://platform.openai.com/docs/model-index-for-researchers) | OpenAI | yes | | | | [0.463](https://crfm.stanford.edu/helm/latest/?group=targeted_evaluations) | | | | |
33
- | [gpt-4](https://arxiv.org/abs/2303.08774v3) | OpenAI | yes | | [0.953](https://arxiv.org/abs/2303.08774v3) | | [0.670](https://arxiv.org/abs/2303.08774v3) | | | [0.864](https://arxiv.org/abs/2303.08774v3) | |
34
- | [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) | EleutherAI | yes | | [0.718](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.719](https://www.mosaicml.com/blog/mpt-7b) | | [0.719](https://www.mosaicml.com/blog/mpt-7b) | [0.269](https://www.mosaicml.com/blog/mpt-7b) | [0.276](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.347](https://www.mosaicml.com/blog/mpt-7b) |
35
- | [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | EleutherAI | yes | | [0.663](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.683](https://www.mosaicml.com/blog/mpt-7b) | | [0.683](https://www.mosaicml.com/blog/mpt-7b) | [0.261](https://www.mosaicml.com/blog/mpt-7b) | [0.249](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.234](https://www.mosaicml.com/blog/mpt-7b) |
36
- | [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/) | Berkeley BAIR | no | [1082](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
37
- | [llama-7b](https://arxiv.org/abs/2302.13971) | Meta AI | no | | | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.105](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.302](https://www.mosaicml.com/blog/mpt-7b) | | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
38
- | [llama-13b](https://arxiv.org/abs/2302.13971) | Meta AI | no | [932](https://lmsys.org/blog/2023-05-03-arena/) | | [0.792](https://arxiv.org/abs/2302.13971) | [0.158](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
39
- | [llama-33b](https://arxiv.org/abs/2302.13971) | Meta AI | no | | | [0.828](https://arxiv.org/abs/2302.13971) | [0.217](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
40
- | [llama-65b](https://arxiv.org/abs/2302.13971) | Meta AI | no | | | [0.842](https://arxiv.org/abs/2302.13971) | [0.237](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | [0.634](https://arxiv.org/abs/2302.13971v1) | |
41
- | [mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | MosaicML | yes | | | [0.761](https://www.mosaicml.com/blog/mpt-7b) | | [0.702](https://www.mosaicml.com/blog/mpt-7b) | [0.296](https://www.mosaicml.com/blog/mpt-7b) | | [0.343](https://www.mosaicml.com/blog/mpt-7b) |
42
- | [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | Open Assistant | yes | [1065](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
43
- | [opt-7b](https://huggingface.co/facebook/opt-6.7b) | Meta AI | no | | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b) | | [0.227](https://www.mosaicml.com/blog/mpt-7b) |
44
- | [opt-13b](https://huggingface.co/facebook/opt-13b) | Meta AI | no | | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | [0.257](https://www.mosaicml.com/blog/mpt-7b) | | [0.282](https://www.mosaicml.com/blog/mpt-7b) |
45
- | [opt-66b](https://huggingface.co/facebook/opt-66b) | Meta AI | no | | [0.745](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | | [0.276](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | |
46
- | [opt-175b](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) | Meta AI | no | | [0.791](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | | [0.318](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | |
47
- | [palm-540b](https://arxiv.org/abs/2204.02311v5) | Google Research | no | | [0.838](https://arxiv.org/abs/2204.02311v5) | [0.834](https://arxiv.org/abs/2204.02311v5) | [0.262](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5) | | [0.693](https://arxiv.org/abs/2204.02311v5) | |
48
- | [replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) | Replit | yes | | | | [0.219](https://twitter.com/amasad/status/1651019556423598081/photo/2) | | | | |
49
- | [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | Stability AI | yes | | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b) | | [0.049](https://www.mosaicml.com/blog/mpt-7b) |
50
- | [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | Stability AI | no | [858](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
51
- | [starcoder-base-16b](https://huggingface.co/bigcode/starcoderbase) | BigCode | yes | | | | [0.304](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
52
- | [starcoder-16b](https://huggingface.co/bigcode/starcoder) | BigCode | yes | | | | [0.336](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
53
- | [starcoder-16b (prompted)](https://huggingface.co/bigcode/starcoder) | BigCode | yes | | | | [0.408](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
54
- | [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | lmsys.org | no | [1169](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | |
 
 
 
 
 
 
55
 
56
  ## Benchmarks
57
 
@@ -63,6 +69,7 @@ https://llm-leaderboard.streamlit.app/
63
  | LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031 | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
64
  | MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
65
  | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2 | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
 
66
 
67
  ## How to Contribute
68
 
 
8
 
9
  ## Leaderboard
10
 
11
+ | Model Name | Publisher | Commercial Use? | Chatbot Arena Elo | HellaSwag (few-shot) | HellaSwag (zero-shot) | HellaSwag (one-shot) | HumanEval-Python (pass@1) | LAMBADA (zero-shot) | LAMBADA (one-shot) | MMLU (zero-shot) | MMLU (few-shot) | TriviaQA (zero-shot) | TriviaQA (one-shot) | WinoGrande (one-shot) | WinoGrande (few-shot) |
12
+ | ----------------------------------------------------------------------------------------------------------- | ------------------- | --------------- | ------------------------------------------------ | -------------------------------------------------------------------- | --------------------------------------------- | --------------------------------------------------------------- | ------------------------------------------------------------------------------- | --------------------------------------------- | --------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------- | --------------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- |
13
+ | [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html) | Stanford | no | [1008](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
14
+ | [bloom-176b](https://huggingface.co/bigscience/bloom) | BigScience | yes | | [0.744](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | [0.155](https://huggingface.co/bigscience/bloom#results) | | | [0.299](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | | |
15
+ | [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B) | Cerebras | yes | | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | | [0.259](https://www.mosaicml.com/blog/mpt-7b) | | [0.141](https://www.mosaicml.com/blog/mpt-7b) | | | |
16
+ | [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B) | Cerebras | yes | | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | | [0.258](https://www.mosaicml.com/blog/mpt-7b) | | [0.146](https://www.mosaicml.com/blog/mpt-7b) | | | |
17
+ | [chatglm-6b](https://chatglm.cn/blog) | ChatGLM | yes | [985](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
18
+ | [chinchilla-70b](https://arxiv.org/abs/2203.15556v1) | DeepMind | no | | | [0.808](https://arxiv.org/abs/2203.15556v1) | | | [0.774](https://arxiv.org/abs/2203.15556v1) | | | [0.675](https://arxiv.org/abs/2203.15556v1) | | | | |
19
+ | [codex-12b / code-cushman-001](https://arxiv.org/abs/2107.03374) | OpenAI | yes | | | | | [0.317](https://crfm.stanford.edu/helm/latest/?group=targeted_evaluations) | | | | | | | | |
20
+ | [code-davinci-002](https://arxiv.org/abs/2207.10397v2) | OpenAI | yes | | | | | [0.658](https://arxiv.org/abs/2207.10397v2) | | | | | | | | |
21
+ | [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | Salesforce | yes | | | | | [0.293](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
22
+ | [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi) | Salesforce | yes | | | | | [0.183](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
23
+ | [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/) | Tsinghua University | no | | | | | [0.229](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
24
+ | [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) | Databricks | yes | [944](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
25
+ | [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b) | EleutherAI | yes | | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | | [0.265](https://www.mosaicml.com/blog/mpt-7b) | | [0.198](https://www.mosaicml.com/blog/mpt-7b) | | | |
26
+ | [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b) | EleutherAI | yes | | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | | [0.253](https://www.mosaicml.com/blog/mpt-7b) | | [0.233](https://www.mosaicml.com/blog/mpt-7b) | | | |
27
+ | [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) | lmsys.org | yes | [951](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
28
+ | [gal-120b](https://arxiv.org/abs/2211.09085v1) | lmsys.org | no | | | | | | | | [0.526](https://paperswithcode.com/paper/galactica-a-large-language-model-for-science-1) | | | | | |
29
+ | [gpt-3-7b / curie](https://arxiv.org/abs/2005.14165) | OpenAI | yes | | [0.682](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | | | | [0.243](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | |
30
+ | [gpt-3-175b / davinci](https://arxiv.org/abs/2005.14165) | OpenAI | yes | | [0.793](https://arxiv.org/abs/2005.14165) | [0.789](https://arxiv.org/abs/2005.14165) | | | | | | [0.439](https://arxiv.org/abs/2005.14165) | | | | |
31
+ | [gpt-3.5-175b / text-davinci-003](https://arxiv.org/abs/2303.08774v3) | OpenAI | yes | | [0.822](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | [0.481](https://arxiv.org/abs/2303.08774v3) | [0.762](https://arxiv.org/abs/2303.08774v3) | | | [0.569](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | |
32
+ | [gpt-3.5-175b / code-davinci-002](https://platform.openai.com/docs/model-index-for-researchers) | OpenAI | yes | | | | | [0.463](https://crfm.stanford.edu/helm/latest/?group=targeted_evaluations) | | | | | | | | |
33
+ | [gpt-4](https://arxiv.org/abs/2303.08774v3) | OpenAI | yes | | [0.953](https://arxiv.org/abs/2303.08774v3) | | | [0.670](https://arxiv.org/abs/2303.08774v3) | | | | [0.864](https://arxiv.org/abs/2303.08774v3) | | | | |
34
+ | [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) | EleutherAI | yes | | [0.718](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.719](https://www.mosaicml.com/blog/mpt-7b) | | | [0.719](https://www.mosaicml.com/blog/mpt-7b) | | [0.269](https://www.mosaicml.com/blog/mpt-7b) | [0.276](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.347](https://www.mosaicml.com/blog/mpt-7b) | | | |
35
+ | [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | EleutherAI | yes | | [0.663](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.683](https://www.mosaicml.com/blog/mpt-7b) | | | [0.683](https://www.mosaicml.com/blog/mpt-7b) | | [0.261](https://www.mosaicml.com/blog/mpt-7b) | [0.249](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | [0.234](https://www.mosaicml.com/blog/mpt-7b) | | | |
36
+ | [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/) | Berkeley BAIR | no | [1082](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
37
+ | [llama-7b](https://arxiv.org/abs/2302.13971) | Meta AI | no | | | [0.738](https://www.mosaicml.com/blog/mpt-7b) | | [0.105](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | | [0.302](https://www.mosaicml.com/blog/mpt-7b) | | [0.443](https://www.mosaicml.com/blog/mpt-7b) | | | |
38
+ | [llama-13b](https://arxiv.org/abs/2302.13971) | Meta AI | no | [932](https://lmsys.org/blog/2023-05-03-arena/) | | [0.792](https://arxiv.org/abs/2302.13971) | | [0.158](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
39
+ | [llama-33b](https://arxiv.org/abs/2302.13971) | Meta AI | no | | | [0.828](https://arxiv.org/abs/2302.13971) | | [0.217](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
40
+ | [llama-65b](https://arxiv.org/abs/2302.13971) | Meta AI | no | | | [0.842](https://arxiv.org/abs/2302.13971) | | [0.237](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | [0.634](https://arxiv.org/abs/2302.13971v1) | | | | |
41
+ | [mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | MosaicML | yes | | | [0.761](https://www.mosaicml.com/blog/mpt-7b) | | | [0.702](https://www.mosaicml.com/blog/mpt-7b) | | [0.296](https://www.mosaicml.com/blog/mpt-7b) | | [0.343](https://www.mosaicml.com/blog/mpt-7b) | | | |
42
+ | [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | Open Assistant | yes | [1065](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
43
+ | [opt-7b](https://huggingface.co/facebook/opt-6.7b) | Meta AI | no | | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | | [0.251](https://www.mosaicml.com/blog/mpt-7b) | | [0.227](https://www.mosaicml.com/blog/mpt-7b) | | | |
44
+ | [opt-13b](https://huggingface.co/facebook/opt-13b) | Meta AI | no | | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | | [0.257](https://www.mosaicml.com/blog/mpt-7b) | | [0.282](https://www.mosaicml.com/blog/mpt-7b) | | | |
45
+ | [opt-66b](https://huggingface.co/facebook/opt-66b) | Meta AI | no | | [0.745](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | | | | [0.276](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | |
46
+ | [opt-175b](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) | Meta AI | no | | [0.791](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | | | | [0.318](https://crfm.stanford.edu/helm/latest/?group=core_scenarios) | | | | |
47
+ | [palm-540b](https://arxiv.org/abs/2204.02311v5) | Google Research | no | | [0.838](https://arxiv.org/abs/2204.02311v5) | [0.834](https://arxiv.org/abs/2204.02311v5) | [0.836](https://ai.google/static/documents/palm2techreport.pdf) | [0.262](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5) | [0.818](https://ai.google/static/documents/palm2techreport.pdf) | | [0.693](https://arxiv.org/abs/2204.02311v5) | | [0.814](https://ai.google/static/documents/palm2techreport.pdf) | [0.837](https://ai.google/static/documents/palm2techreport.pdf) | [0.851](https://arxiv.org/abs/2204.02311) |
48
+ | [palm-coder-540b](https://arxiv.org/abs/2204.02311) | Google Research | no | | | | | [0.359](https://ai.google/static/documents/palm2techreport.pdf) | | | | | | | | |
49
+ | [palm-2-s](https://ai.google/static/documents/palm2techreport.pdf) | Google Research | ? | | | | [0.820](https://ai.google/static/documents/palm2techreport.pdf) | | | [0.807](https://ai.google/static/documents/palm2techreport.pdf) | | | | [0.752](https://ai.google/static/documents/palm2techreport.pdf) | [0.779](https://ai.google/static/documents/palm2techreport.pdf) | |
50
+ | [palm-2-s*](https://ai.google/static/documents/palm2techreport.pdf) | Google Research | ? | | | | | [0.376](https://ai.google/static/documents/palm2techreport.pdf) | | | | | | | | |
51
+ | [palm-2-m](https://ai.google/static/documents/palm2techreport.pdf) | Google Research | ? | | | | [0.840](https://ai.google/static/documents/palm2techreport.pdf) | | | [0.837](https://ai.google/static/documents/palm2techreport.pdf) | | | | [0.817](https://ai.google/static/documents/palm2techreport.pdf) | [0.792](https://ai.google/static/documents/palm2techreport.pdf) | |
52
+ | [palm-2-l](https://ai.google/static/documents/palm2techreport.pdf) | Google Research | ? | | | | [0.868](https://ai.google/static/documents/palm2techreport.pdf) | | | [0.869](https://ai.google/static/documents/palm2techreport.pdf) | | | | [0.861](https://ai.google/static/documents/palm2techreport.pdf) | [0.830](https://ai.google/static/documents/palm2techreport.pdf) | |
53
+ | [palm-2-l-instruct](https://ai.google/static/documents/palm2techreport.pdf) | Google Research | ? | | | | | | | | | | | | | [0.909](https://ai.google/static/documents/palm2techreport.pdf) |
54
+ | [replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) | Replit | yes | | | | | [0.219](https://twitter.com/amasad/status/1651019556423598081/photo/2) | | | | | | | | |
55
+ | [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | Stability AI | yes | | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | | [0.251](https://www.mosaicml.com/blog/mpt-7b) | | [0.049](https://www.mosaicml.com/blog/mpt-7b) | | | |
56
+ | [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | Stability AI | no | [858](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
57
+ | [starcoder-base-16b](https://huggingface.co/bigcode/starcoderbase) | BigCode | yes | | | | | [0.304](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
58
+ | [starcoder-16b](https://huggingface.co/bigcode/starcoder) | BigCode | yes | | | | | [0.336](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
59
+ | [starcoder-16b (prompted)](https://huggingface.co/bigcode/starcoder) | BigCode | yes | | | | | [0.408](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | | | | | |
60
+ | [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | lmsys.org | no | [1169](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | | | | | | | |
61
 
62
  ## Benchmarks
63
 
 
69
  | LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031 | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
70
  | MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
71
  | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2 | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
72
+ | WinoGrande | Sakaguchi et al. | https://arxiv.org/abs/1907.10641v2 | "A large-scale dataset of 44k [expert-crafted pronoun resolution] problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset." (Source: https://arxiv.org/abs/1907.10641v2) |
73
 
74
  ## How to Contribute
75