Spaces:
Paused
Paused
Commit
β’
d313dbd
1
Parent(s):
a33e66d
Small update and reorg of text (#198)
Browse files- update text (fe3693e0293139520a7de4623b55c39d36049df2)
- src/assets/text_content.py +50 -46
src/assets/text_content.py
CHANGED
@@ -59,18 +59,31 @@ CHANGELOG_TEXT = f"""
|
|
59 |
TITLE = """<h1 align="center" id="space-title">π€ Open LLM Leaderboard</h1>"""
|
60 |
|
61 |
INTRODUCTION_TEXT = f"""
|
62 |
-
π The π€ Open LLM Leaderboard aims to track, rank and evaluate LLMs and chatbots
|
63 |
|
64 |
-
π€
|
65 |
|
66 |
-
|
|
|
|
|
|
|
|
|
67 |
"""
|
68 |
|
69 |
LLM_BENCHMARKS_TEXT = f"""
|
70 |
# Context
|
71 |
With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
|
72 |
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
76 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
@@ -80,38 +93,13 @@ With the plethora of large language models (LLMs) and chatbots being released we
|
|
80 |
For all these evaluations, a higher score is a better score.
|
81 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
82 |
|
83 |
-
|
84 |
-
|
85 |
-
### 1) Make sure you can load your model and tokenizer using AutoClasses:
|
86 |
-
```python
|
87 |
-
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
88 |
-
config = AutoConfig.from_pretrained("your model name", revision=revision)
|
89 |
-
model = AutoModel.from_pretrained("your model name", revision=revision)
|
90 |
-
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
|
91 |
-
```
|
92 |
-
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
|
93 |
-
|
94 |
-
Note: make sure your model is public!
|
95 |
-
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
|
96 |
-
|
97 |
-
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
|
98 |
-
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
|
99 |
-
|
100 |
-
### 3) Make sure your model has an open license!
|
101 |
-
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model π€
|
102 |
-
|
103 |
-
### 4) Fill up your model card
|
104 |
-
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
|
105 |
-
|
106 |
-
# Reproducibility and details
|
107 |
-
|
108 |
-
### Details and logs
|
109 |
You can find:
|
110 |
- detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/results
|
111 |
- details on the input/outputs for the models in the `details` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/details
|
112 |
- community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/requests
|
113 |
|
114 |
-
|
115 |
To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
|
116 |
`python main.py --model=hf-causal --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
|
117 |
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=2 --output_path=<output_path>`
|
@@ -125,29 +113,45 @@ The tasks and few shots parameters are:
|
|
125 |
- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
|
126 |
- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
|
127 |
|
128 |
-
|
129 |
To get more information about quantization, see:
|
130 |
- 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
|
131 |
- 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
|
132 |
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
140 |
|
141 |
-
|
|
|
|
|
|
|
142 |
If your model is displayed in the `FAILED` category, its execution stopped.
|
143 |
Make sure you have followed the above steps first.
|
144 |
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
|
145 |
-
|
146 |
-
"""
|
147 |
-
|
148 |
-
EVALUATION_QUEUE_TEXT = f"""
|
149 |
-
# Evaluation Queue for the π€ Open LLM Leaderboard
|
150 |
-
These models will be automatically evaluated on the π€ cluster.
|
151 |
"""
|
152 |
|
153 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
@@ -216,4 +220,4 @@ CITATION_BUTTON_TEXT = r"""
|
|
216 |
eprint={2109.07958},
|
217 |
archivePrefix={arXiv},
|
218 |
primaryClass={cs.CL}
|
219 |
-
}"""
|
|
|
59 |
TITLE = """<h1 align="center" id="space-title">π€ Open LLM Leaderboard</h1>"""
|
60 |
|
61 |
INTRODUCTION_TEXT = f"""
|
62 |
+
π The π€ Open LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
|
63 |
|
64 |
+
π€ Submit a model for automated evaluation on the π€ GPU cluster on the "Submit" page!
|
65 |
|
66 |
+
The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to compute numbers. Read more details and reproducibility on the "About" page!
|
67 |
+
|
68 |
+
Other cool benchmarks for LLMs are developed at HuggingFace: ππ€ [human and GPT4 evals](https://huggingface.co/spaces/HuggingFaceH4/human_eval_llm_leaderboard), π₯οΈ [performance benchmarks](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)
|
69 |
+
|
70 |
+
And also in other labs, check out the [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) and [MT Bench](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) among other great ressources.
|
71 |
"""
|
72 |
|
73 |
LLM_BENCHMARKS_TEXT = f"""
|
74 |
# Context
|
75 |
With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
|
76 |
|
77 |
+
## Icons
|
78 |
+
{ModelType.PT.to_str(" : ")} model
|
79 |
+
{ModelType.FT.to_str(" : ")} model
|
80 |
+
{ModelType.IFT.to_str(" : ")} model
|
81 |
+
{ModelType.RL.to_str(" : ")} model
|
82 |
+
If there is no icon, we have not uploaded the information on the model yet, feel free to open an issue with the model information!
|
83 |
+
|
84 |
+
## How it works
|
85 |
+
|
86 |
+
π We evaluate models on 4 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
|
87 |
|
88 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
89 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
|
|
93 |
For all these evaluations, a higher score is a better score.
|
94 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
95 |
|
96 |
+
## Details and logs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
You can find:
|
98 |
- detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/results
|
99 |
- details on the input/outputs for the models in the `details` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/details
|
100 |
- community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/requests
|
101 |
|
102 |
+
## Reproducibility
|
103 |
To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
|
104 |
`python main.py --model=hf-causal --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
|
105 |
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=2 --output_path=<output_path>`
|
|
|
113 |
- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
|
114 |
- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
|
115 |
|
116 |
+
## Quantization
|
117 |
To get more information about quantization, see:
|
118 |
- 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
|
119 |
- 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
|
120 |
|
121 |
+
"""
|
122 |
+
|
123 |
+
EVALUATION_QUEUE_TEXT = f"""
|
124 |
+
# Evaluation Queue for the π€ Open LLM Leaderboard
|
125 |
+
|
126 |
+
Models added here will be automatically evaluated on the π€ cluster.
|
127 |
+
|
128 |
+
## Some good practices before submitting a model
|
129 |
+
|
130 |
+
### 1) Make sure you can load your model and tokenizer using AutoClasses:
|
131 |
+
```python
|
132 |
+
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
133 |
+
config = AutoConfig.from_pretrained("your model name", revision=revision)
|
134 |
+
model = AutoModel.from_pretrained("your model name", revision=revision)
|
135 |
+
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
|
136 |
+
```
|
137 |
+
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
|
138 |
|
139 |
+
Note: make sure your model is public!
|
140 |
+
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
|
141 |
+
|
142 |
+
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
|
143 |
+
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
|
144 |
+
|
145 |
+
### 3) Make sure your model has an open license!
|
146 |
+
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model π€
|
147 |
|
148 |
+
### 4) Fill up your model card
|
149 |
+
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
|
150 |
+
|
151 |
+
## In case of model failure
|
152 |
If your model is displayed in the `FAILED` category, its execution stopped.
|
153 |
Make sure you have followed the above steps first.
|
154 |
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
|
|
|
|
|
|
|
|
|
|
|
|
|
155 |
"""
|
156 |
|
157 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
|
|
220 |
eprint={2109.07958},
|
221 |
archivePrefix={arXiv},
|
222 |
primaryClass={cs.CL}
|
223 |
+
}"""
|