Ludwig Stumpp commited on
Commit
c0dd25e
1 Parent(s): 53be3b4

Adding MMLU dataset and removing source table

Browse files
Files changed (2) hide show
  1. README.md +46 -51
  2. streamlit_app.py +12 -19
README.md CHANGED
@@ -2,8 +2,6 @@
2
 
3
  A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!
4
 
5
- Sources for the numbers are added as links to the table.
6
-
7
  ## Interactive Dashboard
8
 
9
  https://llm-leaderboard.streamlit.app/
@@ -14,7 +12,7 @@ We are always happy for contributions! You can contribute by the following:
14
 
15
  - table work (don't forget the links):
16
  - filling missing entries
17
- - adding a new model as a new row to the leaderboard and add the source of the evaluation to the sources table. Please keep alphabetic order.
18
  - adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
19
  - code work:
20
  - improving the existing code
@@ -22,57 +20,54 @@ We are always happy for contributions! You can contribute by the following:
22
 
23
  ## Leaderboard
24
 
25
- | Model Name | Chatbot Arena Elo | HumanEval-Python (pass@1) | LAMBADA (zero-shot) | TriviaQA (zero-shot) |
26
- | -------------------------------------------------------------------------------------- | ------------------------------------------------ | ------------------------------------------------------------------------------ | --------------------------------------------- | --------------------------------------------- |
27
- | [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html) | [1008](https://lmsys.org/blog/2023-05-03-arena/) | | | |
28
- | [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B) | | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | [0.141](https://www.mosaicml.com/blog/mpt-7b) |
29
- | [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B) | | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | [0.146](https://www.mosaicml.com/blog/mpt-7b) |
30
- | [chatglm-6b](https://chatglm.cn/blog) | [985](https://lmsys.org/blog/2023-05-03-arena/) | | | |
31
- | [chinchilla-70b](https://arxiv.org/abs/2203.15556v1) | | | [0.774](https://arxiv.org/abs/2203.15556v1) | |
32
- | [code-cushman-001](https://arxiv.org/abs/2107.03374) | | [33.5](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
33
- | [code-davinci-002](https://arxiv.org/abs/2207.10397v2) | | [65.8](https://arxiv.org/abs/2207.10397v2) | | |
34
- | [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | | [29.3](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
35
- | [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi) | | [18.3](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
36
- | [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/) | | [22.9](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
37
- | [codex-12b](https://arxiv.org/abs/2107.03374v2) | | [28.81](https://arxiv.org/abs/2107.03374v2) | | |
38
- | [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) | [944](https://lmsys.org/blog/2023-05-03-arena/) | | | |
39
- | [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b) | | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | [0.198](https://www.mosaicml.com/blog/mpt-7b) |
40
- | [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b) | | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | [0.233](https://www.mosaicml.com/blog/mpt-7b) |
41
- | [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) | [951](https://lmsys.org/blog/2023-05-03-arena/) | | | |
42
- | [gpt-3.5-175b](https://arxiv.org/abs/2303.08774v3) | | [48.1](https://arxiv.org/abs/2303.08774v3) | [0.762](https://arxiv.org/abs/2303.08774v3) | |
43
- | [gpt-4](https://arxiv.org/abs/2303.08774v3) | | [67.0](https://arxiv.org/abs/2303.08774v3) | | |
44
- | [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) | | | [0.719](https://www.mosaicml.com/blog/mpt-7b) | [0.347](https://www.mosaicml.com/blog/mpt-7b) |
45
- | [gptj-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | | | [0.683](https://www.mosaicml.com/blog/mpt-7b) | [0.234](https://www.mosaicml.com/blog/mpt-7b) |
46
- | [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/) | [1082](https://lmsys.org/blog/2023-05-03-arena/) | | | |
47
- | [llama-7b](https://arxiv.org/abs/2302.13971) | | [10.5](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
48
- | [llama-13b](https://arxiv.org/abs/2302.13971) | [932](https://lmsys.org/blog/2023-05-03-arena/) | [15.8](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
49
- | [llama-33b](https://arxiv.org/abs/2302.13971) | | [21.7](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
50
- | [llama-65b](https://arxiv.org/abs/2302.13971) | | [23.7](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
51
- | [mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | | | [0.702](https://www.mosaicml.com/blog/mpt-7b) | [0.343](https://www.mosaicml.com/blog/mpt-7b) |
52
- | [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | [1065](https://lmsys.org/blog/2023-05-03-arena/) | | | |
53
- | [opt-7b](https://huggingface.co/facebook/opt-6.7b) | | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | [0.227](https://www.mosaicml.com/blog/mpt-7b) |
54
- | [opt-13b](https://huggingface.co/facebook/opt-13b) | | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | [0.282](https://www.mosaicml.com/blog/mpt-7b) |
55
- | [palm-540b](https://arxiv.org/abs/2204.02311v5) | | [26.2](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5) | |
56
- | [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | [0.049](https://www.mosaicml.com/blog/mpt-7b) |
57
- | [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | [858](https://lmsys.org/blog/2023-05-03-arena/) | | | |
58
- | [starcoder-base-16B](https://huggingface.co/bigcode/starcoderbase) | | [30.4](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
59
- | [starcoder-16B](https://huggingface.co/bigcode/starcoder) | | [33.6](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
60
- | [starcoder-16B (prompted)](https://huggingface.co/bigcode/starcoder) | | [40.8](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | |
61
- | [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | [1169](https://lmsys.org/blog/2023-05-03-arena/) | | | |
62
 
63
  ## Benchmarks
64
 
65
- | Benchmark Name | Author | Link | Description |
66
- | ----------------- | -------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
67
- | Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
68
- | HumanEval | Chen et al. | https://arxiv.org/abs/2107.03374v2 | "It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval) |
69
- | LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031 | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
70
- | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2 | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
 
71
 
72
  ## Sources
73
 
74
- | Author | Link |
75
- | -------- | ---------------------------------------- |
76
- | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ |
77
- | MOSAICML | https://www.mosaicml.com/blog/mpt-7b |
78
- | BigCode | https://www.bigcode-project.org/ |
 
2
 
3
  A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!
4
 
 
 
5
  ## Interactive Dashboard
6
 
7
  https://llm-leaderboard.streamlit.app/
 
12
 
13
  - table work (don't forget the links):
14
  - filling missing entries
15
+ - adding a new model as a new row to the leaderboard. Please keep alphabetic order.
16
  - adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
17
  - code work:
18
  - improving the existing code
 
20
 
21
  ## Leaderboard
22
 
23
+ | Model Name | Chatbot Arena Elo | HumanEval-Python (pass@1) | LAMBADA (zero-shot) | MMLU (zero-shot) | TriviaQA (zero-shot) |
24
+ | -------------------------------------------------------------------------------------- | ------------------------------------------------ | ------------------------------------------------------------------------------ | --------------------------------------------- | ---------------- | --------------------------------------------- |
25
+ | [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html) | [1008](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
26
+ | [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B) | | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | 0.259 | [0.141](https://www.mosaicml.com/blog/mpt-7b) |
27
+ | [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B) | | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | 0.258 | [0.146](https://www.mosaicml.com/blog/mpt-7b) |
28
+ | [chatglm-6b](https://chatglm.cn/blog) | [985](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
29
+ | [chinchilla-70b](https://arxiv.org/abs/2203.15556v1) | | | [0.774](https://arxiv.org/abs/2203.15556v1) | | |
30
+ | [code-cushman-001](https://arxiv.org/abs/2107.03374) | | [33.5](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
31
+ | [code-davinci-002](https://arxiv.org/abs/2207.10397v2) | | [65.8](https://arxiv.org/abs/2207.10397v2) | | | |
32
+ | [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | | [29.3](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
33
+ | [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi) | | [18.3](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
34
+ | [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/) | | [22.9](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
35
+ | [codex-12b](https://arxiv.org/abs/2107.03374v2) | | [28.81](https://arxiv.org/abs/2107.03374v2) | | | |
36
+ | [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) | [944](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
37
+ | [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b) | | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | 0.265 | [0.198](https://www.mosaicml.com/blog/mpt-7b) |
38
+ | [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b) | | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | 0.253 | [0.233](https://www.mosaicml.com/blog/mpt-7b) |
39
+ | [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) | [951](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
40
+ | [gpt-3.5-175b](https://arxiv.org/abs/2303.08774v3) | | [48.1](https://arxiv.org/abs/2303.08774v3) | [0.762](https://arxiv.org/abs/2303.08774v3) | | |
41
+ | [gpt-4](https://arxiv.org/abs/2303.08774v3) | | [67.0](https://arxiv.org/abs/2303.08774v3) | | | |
42
+ | [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) | | | [0.719](https://www.mosaicml.com/blog/mpt-7b) | 0.269 | [0.347](https://www.mosaicml.com/blog/mpt-7b) |
43
+ | [gptj-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | | | [0.683](https://www.mosaicml.com/blog/mpt-7b) | 0.261 | [0.234](https://www.mosaicml.com/blog/mpt-7b) |
44
+ | [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/) | [1082](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
45
+ | [llama-7b](https://arxiv.org/abs/2302.13971) | | [10.5](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | 0.302 | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
46
+ | [llama-13b](https://arxiv.org/abs/2302.13971) | [932](https://lmsys.org/blog/2023-05-03-arena/) | [15.8](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
47
+ | [llama-33b](https://arxiv.org/abs/2302.13971) | | [21.7](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
48
+ | [llama-65b](https://arxiv.org/abs/2302.13971) | | [23.7](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
49
+ | [mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | | | [0.702](https://www.mosaicml.com/blog/mpt-7b) | 0.296 | [0.343](https://www.mosaicml.com/blog/mpt-7b) |
50
+ | [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | [1065](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
51
+ | [opt-7b](https://huggingface.co/facebook/opt-6.7b) | | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | 0.251 | [0.227](https://www.mosaicml.com/blog/mpt-7b) |
52
+ | [opt-13b](https://huggingface.co/facebook/opt-13b) | | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | 0.257 | [0.282](https://www.mosaicml.com/blog/mpt-7b) |
53
+ | [palm-540b](https://arxiv.org/abs/2204.02311v5) | | [26.2](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5) | | |
54
+ | [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | 0.251 | [0.049](https://www.mosaicml.com/blog/mpt-7b) |
55
+ | [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | [858](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
56
+ | [starcoder-base-16B](https://huggingface.co/bigcode/starcoderbase) | | [30.4](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
57
+ | [starcoder-16B](https://huggingface.co/bigcode/starcoder) | | [33.6](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
58
+ | [starcoder-16B (prompted)](https://huggingface.co/bigcode/starcoder) | | [40.8](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | |
59
+ | [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | [1169](https://lmsys.org/blog/2023-05-03-arena/) | | | | |
60
 
61
  ## Benchmarks
62
 
63
+ | Benchmark Name | Author | Link | Description |
64
+ | ----------------- | ---------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
65
+ | Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
66
+ | HumanEval | Chen et al. | https://arxiv.org/abs/2107.03374v2 | "It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval) |
67
+ | LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031 | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
68
+ | MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
69
+ | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2 | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
70
 
71
  ## Sources
72
 
73
+ The results of this leaderboard are collected from the individual papers and published results of the model authors. For each reported value, the source is added as a link.
 
 
 
 
streamlit_app.py CHANGED
@@ -4,6 +4,8 @@ import re
4
  import pandas as pd
5
  import streamlit as st
6
 
 
 
7
 
8
  def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
9
  """Extracts a table from a markdown text and formats it as a pandas DataFrame.
@@ -127,10 +129,8 @@ def setup_basic():
127
  st.title(title)
128
 
129
  st.markdown(
130
- """
131
- A joint community effort to create one central leaderboard for LLMs.
132
- Visit [llm-leaderboard](https://github.com/LudwigStumpp/llm-leaderboard) to contribute.
133
- """
134
  )
135
 
136
 
@@ -159,20 +159,13 @@ def setup_benchmarks(readme: str):
159
  st.markdown(" \n".join(text))
160
 
161
 
162
- def setup_sources(readme: str):
163
- sources_table = extract_markdown_table_from_multiline(readme, table_headline="## Sources")
164
- df_sources = extract_table_and_format_from_markdown_text(sources_table)
165
-
166
- st.markdown("## Sources of Above Figures")
167
-
168
- selected_source = st.selectbox("Select a source to learn more:", df_sources.index.unique())
169
- df_selected = df_sources.loc[selected_source]
170
- text = [
171
- f"Author: {selected_source}",
172
- ]
173
- for key in df_selected.keys():
174
- text.append(f"{key}: {df_selected[key]}")
175
- st.markdown(" \n".join(text))
176
 
177
 
178
  def setup_footer():
@@ -192,7 +185,7 @@ def main():
192
 
193
  setup_leaderboard(readme)
194
  setup_benchmarks(readme)
195
- setup_sources(readme)
196
  setup_footer()
197
 
198
 
 
4
  import pandas as pd
5
  import streamlit as st
6
 
7
+ GITHUB_URL = "https://github.com/LudwigStumpp/llm-leaderboard"
8
+
9
 
10
  def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
11
  """Extracts a table from a markdown text and formats it as a pandas DataFrame.
 
129
  st.title(title)
130
 
131
  st.markdown(
132
+ "A joint community effort to create one central leaderboard for LLMs."
133
+ f" Visit [llm-leaderboard]({GITHUB_URL}) to contribute.",
 
 
134
  )
135
 
136
 
 
159
  st.markdown(" \n".join(text))
160
 
161
 
162
+ def setup_sources():
163
+ st.markdown("## Sources")
164
+ st.markdown(
165
+ "The results of this leaderboard are collected from the individual papers and published results of the model "
166
+ "authors. If you are interested in the sources of each individual reported model value, please visit the "
167
+ f"[llm-leaderboard]({GITHUB_URL}) repository."
168
+ )
 
 
 
 
 
 
 
169
 
170
 
171
  def setup_footer():
 
185
 
186
  setup_leaderboard(readme)
187
  setup_benchmarks(readme)
188
+ setup_sources()
189
  setup_footer()
190
 
191