Ludwig Stumpp commited on
Commit
1d376a9
1 Parent(s): 2591e9a
Files changed (2) hide show
  1. README.md +25 -24
  2. streamlit_app.py +25 -1
README.md CHANGED
@@ -10,7 +10,7 @@ https://llm-leaderboard.streamlit.app/
10
 
11
  We are always happy for contributions! You can contribute by the following:
12
 
13
- - table work:
14
  - filling missing entries
15
  - adding a new model as a new row to the leaderboard and add the source of the evaluation to the sources table. Please keep alphabetic order.
16
  - adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
@@ -20,35 +20,36 @@ We are always happy for contributions! You can contribute by the following:
20
 
21
  ## Leaderboard
22
 
23
- | Model Name | Chatbot Arena Elo | LAMBADA (zero-shot) | TriviaQA (zero-shot) |
24
- | ----------------------- | ----------------- | ------------------- | -------------------- |
25
- | alpaca-13b | 1008 | | |
26
- | cerebras-7b | | 0.636 | 0.141 |
27
- | cerebras-13b | | 0.635 | 0.146 |
28
- | chatglm-6b | 985 | | |
29
- | dolly-v2-12b | 944 | | |
30
- | fastchat-t5-3b | 951 | | |
31
- | gpt-neox-20b | | 0.719 | 0.347 |
32
- | gptj-6b | | 0.683 | 0.234 |
33
- | koala-13b | 1082 | | |
34
- | llama-7b | | 0.738 | 0.443 |
35
- | llama-13b | 932 | | |
36
- | mpt-7b | | 0.702 | 0.343 |
37
- | opt-7b | | 0.677 | 0.227 |
38
- | opt-13b | | 0.692 | 0.282 |
39
- | stablelm-base-alpha-7b | | 0.533 | 0.049 |
40
- | stablelm-tuned-alpha-7b | 858 | | |
41
- | vicuna-13b | 1169 | | |
42
- | oasst-pythia-7b | | 0.667 | 0.198 |
43
- | oasst-pythia-12b | 1065 | 0.704 | 0.233 |
 
44
 
45
  ## Benchmarks
46
 
47
  | Benchmark Name | Author | Link | Description |
48
  | ----------------- | -------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
49
  | Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
50
- | LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031" | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
51
- | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2" | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
52
 
53
  ## Sources
54
 
 
10
 
11
  We are always happy for contributions! You can contribute by the following:
12
 
13
+ - table work (don't forget the links):
14
  - filling missing entries
15
  - adding a new model as a new row to the leaderboard and add the source of the evaluation to the sources table. Please keep alphabetic order.
16
  - adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
 
20
 
21
  ## Leaderboard
22
 
23
+ | Model Name | [Chatbot Arena Elo](https://lmsys.org/blog/2023-05-03-arena/) | [LAMBADA (zero-shot)](https://arxiv.org/abs/1606.06031) | [TriviaQA (zero-shot)](https://arxiv.org/abs/1705.03551v2 ) |
24
+ | -------------------------------------------------------------------------------------- | ------------------------------------------------------------- | ------------------------------------------------------- | ----------------------------------------------------------- |
25
+ | [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html) | [1008](https://lmsys.org/blog/2023-05-03-arena/) | | |
26
+ | [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B) | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | [0.141](https://www.mosaicml.com/blog/mpt-7b) |
27
+ | [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B) | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | [0.146](https://www.mosaicml.com/blog/mpt-7b) |
28
+ | [chatglm-6b](https://chatglm.cn/blog) | [985](https://lmsys.org/blog/2023-05-03-arena/) | | |
29
+ | [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) | [944](https://lmsys.org/blog/2023-05-03-arena/) | | |
30
+ | [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b) | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | [0.198](https://www.mosaicml.com/blog/mpt-7b) |
31
+ | [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b) | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | [0.233](https://www.mosaicml.com/blog/mpt-7b) |
32
+ | [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) | [951](https://lmsys.org/blog/2023-05-03-arena/) | | |
33
+ | [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) | | [0.719](https://www.mosaicml.com/blog/mpt-7b) | [0.347](https://www.mosaicml.com/blog/mpt-7b) |
34
+ | [gptj-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | | [0.683](https://www.mosaicml.com/blog/mpt-7b) | [0.234](https://www.mosaicml.com/blog/mpt-7b) |
35
+ | [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/) | [1082](https://lmsys.org/blog/2023-05-03-arena/) | | |
36
+ | [llama-7b](https://arxiv.org/abs/2302.13971) | | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
37
+ | [llama-13b](https://arxiv.org/abs/2302.13971) | [932](https://lmsys.org/blog/2023-05-03-arena/) | | |
38
+ | [mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | | [0.702](https://www.mosaicml.com/blog/mpt-7b) | [0.343](https://www.mosaicml.com/blog/mpt-7b) |
39
+ | [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | [1065](https://lmsys.org/blog/2023-05-03-arena/) | | |
40
+ | [opt-7b](https://huggingface.co/facebook/opt-6.7b) | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | [0.227](https://www.mosaicml.com/blog/mpt-7b) |
41
+ | [opt-13b](https://huggingface.co/facebook/opt-13b) | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | [0.282](https://www.mosaicml.com/blog/mpt-7b) |
42
+ | [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | [0.049](https://www.mosaicml.com/blog/mpt-7b) |
43
+ | [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | [858](https://lmsys.org/blog/2023-05-03-arena/) | | |
44
+ | [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | [1169](https://lmsys.org/blog/2023-05-03-arena/) | | |
45
 
46
  ## Benchmarks
47
 
48
  | Benchmark Name | Author | Link | Description |
49
  | ----------------- | -------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
50
  | Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
51
+ | LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031 | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
52
+ | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2 | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
53
 
54
  ## Sources
55
 
streamlit_app.py CHANGED
@@ -1,6 +1,8 @@
 
 
 
1
  import pandas as pd
2
  import streamlit as st
3
- import io
4
 
5
 
6
  def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
@@ -59,6 +61,27 @@ def extract_markdown_table_from_multiline(multiline: str, table_headline: str, n
59
  return "".join(table)
60
 
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
63
  """
64
  Adds a UI on top of a dataframe to let viewers filter columns
@@ -112,6 +135,7 @@ def setup_basic():
112
 
113
  def setup_leaderboard(readme: str):
114
  leaderboard_table = extract_markdown_table_from_multiline(readme, table_headline="## Leaderboard")
 
115
  df_leaderboard = extract_table_and_format_from_markdown_text(leaderboard_table)
116
 
117
  st.markdown("## Leaderboard")
 
1
+ import io
2
+ import re
3
+
4
  import pandas as pd
5
  import streamlit as st
 
6
 
7
 
8
  def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
 
61
  return "".join(table)
62
 
63
 
64
+ def remove_markdown_links(text: str) -> str:
65
+ """Modifies a markdown text to remove all markdown links.
66
+ Example: [DISPLAY](LINK) to DISPLAY
67
+ First find all markdown links with regex.
68
+ Then replace them with: $1
69
+ Args:
70
+ text (str): Markdown text containing markdown links
71
+ Returns:
72
+ str: Markdown text without markdown links.
73
+ """
74
+
75
+ # find all markdown links
76
+ markdown_links = re.findall(r"\[([^\]]+)\]\(([^)]+)\)", text)
77
+
78
+ # remove link keep display text
79
+ for display, link in markdown_links:
80
+ text = text.replace(f"[{display}]({link})", display)
81
+
82
+ return text
83
+
84
+
85
  def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
86
  """
87
  Adds a UI on top of a dataframe to let viewers filter columns
 
135
 
136
  def setup_leaderboard(readme: str):
137
  leaderboard_table = extract_markdown_table_from_multiline(readme, table_headline="## Leaderboard")
138
+ leaderboard_table = remove_markdown_links(leaderboard_table)
139
  df_leaderboard = extract_table_and_format_from_markdown_text(leaderboard_table)
140
 
141
  st.markdown("## Leaderboard")