Spaces:

ludwigstumpp
/

llm-leaderboard

Runtime error

App Files Files Community

Ludwig Stumpp commited on May 8, 2023

Commit

e1aeb72

•

1 Parent(s): 5e1e4f6

Add HellaSwag Benchmark

Browse files

Files changed (1) hide show

README.md +41 -40

README.md CHANGED Viewed

@@ -8,52 +8,53 @@ https://llm-leaderboard.streamlit.app/
 ## Leaderboard
-| Model Name                                                                             | Commercial Use? | Chatbot Arena Elo                                | HumanEval-Python (pass@1)                                                      | LAMBADA (zero-shot)                           | MMLU (zero-shot)                                                                         | MMLU (few-shot)                             | TriviaQA (zero-shot)                          |
-| -------------------------------------------------------------------------------------- | --------------- | ------------------------------------------------ | ------------------------------------------------------------------------------ | --------------------------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------- | --------------------------------------------- |
-| [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html)                         | no              | [1008](https://lmsys.org/blog/2023-05-03-arena/) |                                                                                |                                               |                                                                                          |                                             |                                               |
-| [bloom-176b](https://huggingface.co/bigscience/bloom)                                  | yes             |                                                  | [0.155](https://huggingface.co/bigscience/bloom#results)                       |                                               |                                                                                          |                                             |                                               |
-| [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B)                   | yes             |                                                  |                                                                                | [0.636](https://www.mosaicml.com/blog/mpt-7b) | [0.259](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.141](https://www.mosaicml.com/blog/mpt-7b) |
-| [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B)                   | yes             |                                                  |                                                                                | [0.635](https://www.mosaicml.com/blog/mpt-7b) | [0.258](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.146](https://www.mosaicml.com/blog/mpt-7b) |
-| [chatglm-6b](https://chatglm.cn/blog)                                                  | yes             | [985](https://lmsys.org/blog/2023-05-03-arena/)  |                                                                                |                                               |                                                                                          |                                             |                                               |
-| [chinchilla-70b](https://arxiv.org/abs/2203.15556v1)                                   | no              |                                                  |                                                                                | [0.774](https://arxiv.org/abs/2203.15556v1)   |                                                                                          | [0.675](https://arxiv.org/abs/2203.15556v1) |                                               |
-| [code-cushman-001](https://arxiv.org/abs/2107.03374)                                   | no              |                                                  | [0.335](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [code-davinci-002](https://arxiv.org/abs/2207.10397v2)                                 | yes             |                                                  | [0.658](https://arxiv.org/abs/2207.10397v2)                                     |                                               |                                                                                          |                                             |                                               |
-| [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono)                 | yes             |                                                  | [0.293](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi)               | yes             |                                                  | [0.183](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/)                                  | no              |                                                  | [0.229](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [codex-12b](https://arxiv.org/abs/2107.03374v2)                                        | no              |                                                  | [0.288](https://arxiv.org/abs/2107.03374v2)                                     |                                               |                                                                                          | [0.685](https://arxiv.org/abs/2301.12652v2) |                                               |
-| [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)                         | yes             | [944](https://lmsys.org/blog/2023-05-03-arena/)  |                                                                                |                                               |                                                                                          |                                             |                                               |
-| [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b)                    | yes             |                                                  |                                                                                | [0.667](https://www.mosaicml.com/blog/mpt-7b) | [0.265](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.198](https://www.mosaicml.com/blog/mpt-7b) |
-| [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b)                    | yes             |                                                  |                                                                                | [0.704](https://www.mosaicml.com/blog/mpt-7b) | [0.253](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.233](https://www.mosaicml.com/blog/mpt-7b) |
-| [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)                     | yes             | [951](https://lmsys.org/blog/2023-05-03-arena/)  |                                                                                |                                               |                                                                                          |                                             |                                               |
-| [gal-120b](https://arxiv.org/abs/2211.09085v1)                                         | no              |                                                  |                                                                                |                                               | [0.526](https://paperswithcode.com/paper/galactica-a-large-language-model-for-science-1) |                                             |                                               |
-| [gpt-3-175b](https://arxiv.org/abs/2005.14165)                                         | no              |                                                  |                                                                                |                                               |                                                                                          | [0.439](https://arxiv.org/abs/2005.14165)   |                                               |
-| [gpt-3.5-175b](https://arxiv.org/abs/2303.08774v3)                                     | yes             |                                                  | [0.481](https://arxiv.org/abs/2303.08774v3)                                     | [0.762](https://arxiv.org/abs/2303.08774v3)   |                                                                                          | [0.700](https://arxiv.org/abs/2303.08774v3) |                                               |
-| [gpt-4](https://arxiv.org/abs/2303.08774v3)                                            | yes             |                                                  | [0.670](https://arxiv.org/abs/2303.08774v3)                                     |                                               |                                                                                          | [0.864](https://arxiv.org/abs/2303.08774v3) |                                               |
-| [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b)                         | yes             |                                                  |                                                                                | [0.719](https://www.mosaicml.com/blog/mpt-7b) | [0.269](https://www.mosaicml.com/blog/mpt-7b)                                            | [0.336](https://arxiv.org/abs/2204.06745v1) | [0.347](https://www.mosaicml.com/blog/mpt-7b) |
-| [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b)                                 | yes             |                                                  |                                                                                | [0.683](https://www.mosaicml.com/blog/mpt-7b) | [0.261](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.234](https://www.mosaicml.com/blog/mpt-7b) |
-| [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/)                          | no              | [1082](https://lmsys.org/blog/2023-05-03-arena/) |                                                                                |                                               |                                                                                          |                                             |                                               |
-| [llama-7b](https://arxiv.org/abs/2302.13971)                                           | no              |                                                  | [0.105](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.302](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
-| [llama-13b](https://arxiv.org/abs/2302.13971)                                          | no              | [932](https://lmsys.org/blog/2023-05-03-arena/)  | [0.158](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [llama-33b](https://arxiv.org/abs/2302.13971)                                          | no              |                                                  | [0.217](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [llama-65b](https://arxiv.org/abs/2302.13971)                                          | no              |                                                  | [0.237](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          | [0.634](https://arxiv.org/abs/2302.13971v1) |                                               |
-| [mpt-7b](https://huggingface.co/mosaicml/mpt-7b)                                       | yes             |                                                  |                                                                                | [0.702](https://www.mosaicml.com/blog/mpt-7b) | [0.296](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.343](https://www.mosaicml.com/blog/mpt-7b) |
-| [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | yes             | [1065](https://lmsys.org/blog/2023-05-03-arena/) |                                                                                |                                               |                                                                                          |                                             |                                               |
-| [opt-7b](https://huggingface.co/facebook/opt-6.7b)                                     | no              |                                                  |                                                                                | [0.677](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.227](https://www.mosaicml.com/blog/mpt-7b) |
-| [opt-13b](https://huggingface.co/facebook/opt-13b)                                     | no              |                                                  |                                                                                | [0.692](https://www.mosaicml.com/blog/mpt-7b) | [0.257](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.282](https://www.mosaicml.com/blog/mpt-7b) |
-| [palm-540b](https://arxiv.org/abs/2204.02311v5)                                        | no              |                                                  | [0.262](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5)   |                                                                                          | [0.693](https://arxiv.org/abs/2204.02311v5) |                                               |
-| [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b)    | yes             |                                                  |                                                                                | [0.533](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.049](https://www.mosaicml.com/blog/mpt-7b) |
-| [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)  | no              | [858](https://lmsys.org/blog/2023-05-03-arena/)  |                                                                                |                                               |                                                                                          |                                             |                                               |
-| [starcoder-base-16b](https://huggingface.co/bigcode/starcoderbase)                     | yes             |                                                  | [0.304](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [starcoder-16b](https://huggingface.co/bigcode/starcoder)                              | yes             |                                                  | [0.336](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [starcoder-16b (prompted)](https://huggingface.co/bigcode/starcoder)                   | yes             |                                                  | [0.408](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
-| [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0)                         | no              | [1169](https://lmsys.org/blog/2023-05-03-arena/) |                                                                                |                                               |                                                                                          |                                             |                                               |
 ## Benchmarks
 | Benchmark Name    | Author           | Link                                     | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
 | ----------------- | ---------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | Chatbot Arena Elo | LMSYS            | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/)                                                                                                                                                                                                                                                                 |
 | HumanEval         | Chen et al.      | https://arxiv.org/abs/2107.03374v2       | "It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval)                                                                                                                                                                                                                                                    |
 | LAMBADA           | Paperno et al.   | https://arxiv.org/abs/1606.06031         | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
 | MMLU              | Hendrycks et al. | https://github.com/hendrycks/test        | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")                                             |

 ## Leaderboard
+| Model Name                                                                             | Commercial Use? | Chatbot Arena Elo                                | HellaSwag (zero-shot)                         | HumanEval-Python (pass@1)                                                       | LAMBADA (zero-shot)                           | MMLU (zero-shot)                                                                         | MMLU (few-shot)                             | TriviaQA (zero-shot)                          |
+| -------------------------------------------------------------------------------------- | --------------- | ------------------------------------------------ | --------------------------------------------- | ------------------------------------------------------------------------------- | --------------------------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------- | --------------------------------------------- |
+| [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html)                         | no              | [1008](https://lmsys.org/blog/2023-05-03-arena/) |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
+| [bloom-176b](https://huggingface.co/bigscience/bloom)                                  | yes             |                                                  |                                               | [0.155](https://huggingface.co/bigscience/bloom#results)                        |                                               |                                                                                          |                                             |                                               |
+| [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B)                   | yes             |                                                  | [0.636](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.636](https://www.mosaicml.com/blog/mpt-7b) | [0.259](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.141](https://www.mosaicml.com/blog/mpt-7b) |
+| [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B)                   | yes             |                                                  | [0.635](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.635](https://www.mosaicml.com/blog/mpt-7b) | [0.258](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.146](https://www.mosaicml.com/blog/mpt-7b) |
+| [chatglm-6b](https://chatglm.cn/blog)                                                  | yes             | [985](https://lmsys.org/blog/2023-05-03-arena/)  |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
+| [chinchilla-70b](https://arxiv.org/abs/2203.15556v1)                                   | no              |                                                  | [0.808](https://arxiv.org/abs/2203.15556v1)   |                                                                                 | [0.774](https://arxiv.org/abs/2203.15556v1)   |                                                                                          | [0.675](https://arxiv.org/abs/2203.15556v1) |                                               |
+| [code-cushman-001](https://arxiv.org/abs/2107.03374)                                   | no              |                                                  |                                               | [0.335](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [code-davinci-002](https://arxiv.org/abs/2207.10397v2)                                 | yes             |                                                  |                                               | [0.658](https://arxiv.org/abs/2207.10397v2)                                     |                                               |                                                                                          |                                             |                                               |
+| [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono)                 | yes             |                                                  |                                               | [0.293](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi)               | yes             |                                                  |                                               | [0.183](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/)                                  | no              |                                                  |                                               | [0.229](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [codex-12b](https://arxiv.org/abs/2107.03374v2)                                        | no              |                                                  |                                               | [0.288](https://arxiv.org/abs/2107.03374v2)                                     |                                               |                                                                                          | [0.685](https://arxiv.org/abs/2301.12652v2) |                                               |
+| [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)                         | yes             | [944](https://lmsys.org/blog/2023-05-03-arena/)  |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
+| [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b)                    | yes             |                                                  | [0.667](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.667](https://www.mosaicml.com/blog/mpt-7b) | [0.265](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.198](https://www.mosaicml.com/blog/mpt-7b) |
+| [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b)                    | yes             |                                                  | [0.704](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.704](https://www.mosaicml.com/blog/mpt-7b) | [0.253](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.233](https://www.mosaicml.com/blog/mpt-7b) |
+| [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)                     | yes             | [951](https://lmsys.org/blog/2023-05-03-arena/)  |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
+| [gal-120b](https://arxiv.org/abs/2211.09085v1)                                         | no              |                                                  |                                               |                                                                                 |                                               | [0.526](https://paperswithcode.com/paper/galactica-a-large-language-model-for-science-1) |                                             |                                               |
+| [gpt-3-175b](https://arxiv.org/abs/2005.14165)                                         | no              |                                                  | [0.789](https://arxiv.org/abs/2005.14165)     |                                                                                 |                                               |                                                                                          | [0.439](https://arxiv.org/abs/2005.14165)   |                                               |
+| [gpt-3.5-175b](https://arxiv.org/abs/2303.08774v3)                                     | yes             |                                                  |                                               | [0.481](https://arxiv.org/abs/2303.08774v3)                                     | [0.762](https://arxiv.org/abs/2303.08774v3)   |                                                                                          | [0.700](https://arxiv.org/abs/2303.08774v3) |                                               |
+| [gpt-4](https://arxiv.org/abs/2303.08774v3)                                            | yes             |                                                  |                                               | [0.670](https://arxiv.org/abs/2303.08774v3)                                     |                                               |                                                                                          | [0.864](https://arxiv.org/abs/2303.08774v3) |                                               |
+| [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b)                         | yes             |                                                  | [0.719](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.719](https://www.mosaicml.com/blog/mpt-7b) | [0.269](https://www.mosaicml.com/blog/mpt-7b)                                            | [0.336](https://arxiv.org/abs/2204.06745v1) | [0.347](https://www.mosaicml.com/blog/mpt-7b) |
+| [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b)                                 | yes             |                                                  | [0.683](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.683](https://www.mosaicml.com/blog/mpt-7b) | [0.261](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.234](https://www.mosaicml.com/blog/mpt-7b) |
+| [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/)                          | no              | [1082](https://lmsys.org/blog/2023-05-03-arena/) |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
+| [llama-7b](https://arxiv.org/abs/2302.13971)                                           | no              |                                                  | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.105](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.302](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
+| [llama-13b](https://arxiv.org/abs/2302.13971)                                          | no              | [932](https://lmsys.org/blog/2023-05-03-arena/)  |                                               | [0.158](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [llama-33b](https://arxiv.org/abs/2302.13971)                                          | no              |                                                  |                                               | [0.217](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [llama-65b](https://arxiv.org/abs/2302.13971)                                          | no              |                                                  |                                               | [0.237](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          | [0.634](https://arxiv.org/abs/2302.13971v1) |                                               |
+| [mpt-7b](https://huggingface.co/mosaicml/mpt-7b)                                       | yes             |                                                  | [0.761](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.702](https://www.mosaicml.com/blog/mpt-7b) | [0.296](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.343](https://www.mosaicml.com/blog/mpt-7b) |
+| [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | yes             | [1065](https://lmsys.org/blog/2023-05-03-arena/) |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
+| [opt-7b](https://huggingface.co/facebook/opt-6.7b)                                     | no              |                                                  | [0.677](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.677](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.227](https://www.mosaicml.com/blog/mpt-7b) |
+| [opt-13b](https://huggingface.co/facebook/opt-13b)                                     | no              |                                                  | [0.692](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.692](https://www.mosaicml.com/blog/mpt-7b) | [0.257](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.282](https://www.mosaicml.com/blog/mpt-7b) |
+| [palm-540b](https://arxiv.org/abs/2204.02311v5)                                        | no              |                                                  | [0.834](https://arxiv.org/abs/2204.02311v5)   | [0.262](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5)   |                                                                                          | [0.693](https://arxiv.org/abs/2204.02311v5) |                                               |
+| [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b)    | yes             |                                                  | [0.533](https://www.mosaicml.com/blog/mpt-7b) |                                                                                 | [0.533](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b)                                            |                                             | [0.049](https://www.mosaicml.com/blog/mpt-7b) |
+| [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)  | no              | [858](https://lmsys.org/blog/2023-05-03-arena/)  |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
+| [starcoder-base-16b](https://huggingface.co/bigcode/starcoderbase)                     | yes             |                                                  |                                               | [0.304](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [starcoder-16b](https://huggingface.co/bigcode/starcoder)                              | yes             |                                                  |                                               | [0.336](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [starcoder-16b (prompted)](https://huggingface.co/bigcode/starcoder)                   | yes             |                                                  |                                               | [0.408](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) |                                               |                                                                                          |                                             |                                               |
+| [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0)                         | no              | [1169](https://lmsys.org/blog/2023-05-03-arena/) |                                               |                                                                                 |                                               |                                                                                          |                                             |                                               |
 ## Benchmarks
 | Benchmark Name    | Author           | Link                                     | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
 | ----------------- | ---------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | Chatbot Arena Elo | LMSYS            | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/)                                                                                                                                                                                                                                                                 |
+| HellaSwag         | Zellers et al.   | https://arxiv.org/abs/1905.07830v1       | "HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)                                                                                                                                                                                                                                                                                                                                             |
 | HumanEval         | Chen et al.      | https://arxiv.org/abs/2107.03374v2       | "It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval)                                                                                                                                                                                                                                                    |
 | LAMBADA           | Paperno et al.   | https://arxiv.org/abs/1606.06031         | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
 | MMLU              | Hendrycks et al. | https://github.com/hendrycks/test        | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")                                             |