Ludwig Stumpp commited on
Commit
908b597
β€’
1 Parent(s): 1df71ac

Switch back to markdown as easier diffable

Browse files
.vscode/extensions.json CHANGED
@@ -1,6 +1,5 @@
1
  {
2
  "recommendations": [
3
- "janisdd.vscode-edit-csv",
4
- "mechatroner.rainbow-csv"
5
  ]
6
  }
 
1
  {
2
  "recommendations": [
3
+ "takumii.markdowntable"
 
4
  ]
5
  }
README.md CHANGED
@@ -1,21 +1,55 @@
1
  # πŸ† llm-leaderboard
2
- A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!
3
-
4
- ## Leaderboard
5
 
 
6
  Visit the interactive leaderboard at https://llm-leaderboard.streamlit.app/.
7
 
8
- ![Screenshot of streamlit application](media/streamlit_screenshot.jpg)
9
-
10
  ## How to Contribute
11
 
12
  We are always happy for contributions! You can contribute by the following:
13
 
14
  - table work:
15
  - filling missing entries
16
- - adding a new model as a new row in `data/leaderboard.csv`, add the source of the evaluation to `data/sources.csv`
17
- - adding a new benchmark as a new column in `data/leaderboard.csv` and add the benchmark to `data/benchmarks.csv`
18
  - code work:
19
  - improving the existing code
20
  - requesting and implementing new features
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # πŸ† llm-leaderboard
 
 
 
2
 
3
+ A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!
4
  Visit the interactive leaderboard at https://llm-leaderboard.streamlit.app/.
5
 
 
 
6
  ## How to Contribute
7
 
8
  We are always happy for contributions! You can contribute by the following:
9
 
10
  - table work:
11
  - filling missing entries
12
+ - adding a new model as a new row to the leaderboard and add the source of the evaluation to the sources table
13
+ - adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table
14
  - code work:
15
  - improving the existing code
16
  - requesting and implementing new features
17
 
18
+ ## Leaderboard
19
+
20
+ | Model Name | Chatbot Arena Elo | LAMBADA (zero-shot) | TriviaQA (zero-shot) |
21
+ | ----------------------- | ----------------- | ------------------- | -------------------- |
22
+ | alpaca-13b | 1008 | | |
23
+ | cerebras-7b | | 0.636 | 0.141 |
24
+ | cerebras-13b | | 0.635 | 0.146 |
25
+ | chatglm-6b | 985 | | |
26
+ | dolly-v2-12b | 944 | | |
27
+ | fastchat-t5-3b | 951 | | |
28
+ | gpt-neox-20b | | 0.719 | 0.347 |
29
+ | gptj-6b | | 0.683 | 0.234 |
30
+ | koala-13b | 1082 | | |
31
+ | llama-7b | | 0.738 | 0.443 |
32
+ | llama-13b | 932 | | |
33
+ | mpt-7b | | 0.702 | 0.343 |
34
+ | opt-7b | | 0.677 | 0.227 |
35
+ | opt-13b | | 0.692 | 0.282 |
36
+ | stablelm-base-alpha-7b | | 0.533 | 0.049 |
37
+ | stablelm-tuned-alpha-7b | 858 | | |
38
+ | vicuna-13b | 1169 | | |
39
+ | oasst-pythia-7b | | 0.667 | 0.198 |
40
+ | oasst-pythia-12b | 1065 | 0.704 | 0.233 |
41
+
42
+ ## Benchmarks
43
+
44
+ | Benchmark Name | Author | Link | Description |
45
+ | ----------------- | -------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
46
+ | Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
47
+ | LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031" | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
48
+ | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2" | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
49
+
50
+ ## Sources
51
+
52
+ | Author | Link |
53
+ | -------- | ---------------------------------------- |
54
+ | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ |
55
+ | MOSAICML | https://www.mosaicml.com/blog/mpt-7b |
data/benchmarks.csv DELETED
@@ -1,4 +0,0 @@
1
- "Benchmark Name" ,"Author" ,"Link" ,"Description "
2
- "Chatbot Arena Elo" ,"LMSYS" ,"https://lmsys.org/blog/2023-05-03-arena/" ,"In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games. (Source: https://lmsys.org/blog/2023-05-03-arena/)"
3
- "LAMBADA" ,"Paperno et al." ,"https://arxiv.org/abs/1606.06031" ,"The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse. (Source: https://huggingface.co/datasets/lambada)"
4
- "TriviaQA" ,"Joshi et al." ,"https://arxiv.org/abs/1705.03551v2" ,"We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. (Source: https://arxiv.org/abs/1705.03551v2)"
 
 
 
 
 
data/leaderboard.csv DELETED
@@ -1,20 +0,0 @@
1
- Model Name ,Chatbot Arena Elo ,LAMBADA (zero-shot) ,TriviaQA (zero-shot)
2
- alpaca-13b , 1008 , ,
3
- cerebras-7b , , 0.636 , 0.141
4
- cerebras-13b , , 0.635 , 0.146
5
- chatglm-6b , 985 , ,
6
- dolly-v2-12b , 944 , ,
7
- fastchat-t5-3b , 951 , ,
8
- gpt-neox-20b , , 0.719 , 0.347
9
- gptj-6b , , 0.683 , 0.234
10
- koala-13b , 1082 , ,
11
- llama-7b , , 0.738 , 0.443
12
- llama-13b , 932 , ,
13
- mpt-7b , , 0.702 , 0.343
14
- opt-7b , , 0.677 , 0.227
15
- opt-13b , , 0.692 , 0.282
16
- stablelm-base-alpha-7b , , 0.533 , 0.049
17
- stablelm-tuned-alpha-7b , 858 , ,
18
- vicuna-13b , 1169 , ,
19
- oasst-pythia-7b , , 0.667 , 0.198
20
- oasst-pythia-12b , 1065 , 0.704 , 0.233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
data/sources.csv DELETED
@@ -1,3 +0,0 @@
1
- Author ,Link
2
- LMSYS ,https://lmsys.org/blog/2023-05-03-arena/
3
- MOSAICML ,https://www.mosaicml.com/blog/mpt-7b
 
 
 
 
streamlit_app.py CHANGED
@@ -1,25 +1,62 @@
1
  import pandas as pd
2
- import requests
3
  import streamlit as st
 
4
 
5
- REPO_URL = "https://github.com/LudwigStumpp/llm-leaderboard"
6
- LEADERBOARD_PATH = "data/leaderboard.csv"
7
- BENCHMARKS_PATH = "data/benchmarks.csv"
8
- SOURCES_PATH = "data/sources.csv"
9
 
 
 
10
 
11
- def grab_file_from_repo(repo_url: str, filename: str) -> str:
12
- """Grabs a file from a GitHub repository.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  Args:
15
- repo_url (str): URL of the GitHub repository.
16
- filename (str): Name of the file to grab.
 
17
 
18
  Returns:
19
- str: Content of the file.
 
 
 
20
  """
21
- url = repo_url.replace("github.com", "raw.githubusercontent.com") + f"/main/{filename}"
22
- return requests.get(url).text
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
 
25
  def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
@@ -56,7 +93,7 @@ def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
56
 
57
 
58
  def setup_basic():
59
- title = "LLM-Leaderboard"
60
 
61
  st.set_page_config(
62
  page_title=title,
@@ -73,24 +110,22 @@ def setup_basic():
73
  )
74
 
75
 
76
- def setup_table():
77
- df = pd.read_csv(LEADERBOARD_PATH, index_col=0)
78
- df = df.sort_index(ascending=True)
79
- df = df.replace(r"^\s*$", float("nan"), regex=True)
80
- df = df.astype(float, errors="ignore")
81
 
82
- st.markdown("### Leaderboard")
83
- st.dataframe(filter_dataframe(df))
84
 
85
 
86
- def setup_benchmarks():
87
- df = pd.read_csv(BENCHMARKS_PATH, index_col=0)
88
- df = df.sort_index(ascending=True)
89
 
90
- st.markdown("### Covered Benchmarks")
91
 
92
- selected_benchmark = st.selectbox("Select a benchmark to learn more:", df.index.unique())
93
- df_selected = df.loc[selected_benchmark]
94
  text = [
95
  f"Name: {selected_benchmark} ",
96
  ]
@@ -99,14 +134,14 @@ def setup_benchmarks():
99
  st.markdown("\n".join(text))
100
 
101
 
102
- def setup_sources():
103
- df = pd.read_csv(SOURCES_PATH, index_col=0)
104
- df = df.sort_index(ascending=True)
105
 
106
- st.markdown("### Sources of Above Figures")
107
 
108
- selected_source = st.selectbox("Select a source to learn more:", df.index.unique())
109
- df_selected = df.loc[selected_source]
110
  text = [
111
  f"Author: {selected_source} ",
112
  ]
@@ -126,9 +161,13 @@ def setup_footer():
126
 
127
  def main():
128
  setup_basic()
129
- setup_table()
130
- setup_benchmarks()
131
- setup_sources()
 
 
 
 
132
  setup_footer()
133
 
134
 
 
1
  import pandas as pd
 
2
  import streamlit as st
3
+ import io
4
 
 
 
 
 
5
 
6
+ def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
7
+ """Extracts a table from a markdown text and formats it as a pandas DataFrame.
8
 
9
+ Args:
10
+ text (str): Markdown text containing a table.
11
+
12
+ Returns:
13
+ pd.DataFrame: Table as pandas DataFrame.
14
+ """
15
+ df = (
16
+ pd.read_table(io.StringIO(markdown_table), sep="|", header=0, index_col=1)
17
+ .dropna(axis=1, how="all") # drop empty columns
18
+ .iloc[1:] # drop first row which is the "----" separator of the original markdown table
19
+ .sort_index(ascending=True)
20
+ .replace(r"^\s*$", float("nan"), regex=True)
21
+ .astype(float, errors="ignore")
22
+ )
23
+
24
+ # remove whitespace from column names and index
25
+ df.columns = df.columns.str.strip()
26
+ df.index = df.index.str.strip()
27
+
28
+ return df
29
+
30
+
31
+ def extract_markdown_table_from_multiline(multiline: str, table_headline: str, next_headline_start: str = "#") -> str:
32
+ """Extracts the markdown table from a multiline string.
33
 
34
  Args:
35
+ multiline (str): content of README.md file.
36
+ table_headline (str): Headline of the table in the README.md file.
37
+ next_headline_start (str, optional): Start of the next headline. Defaults to "#".
38
 
39
  Returns:
40
+ str: Markdown table.
41
+
42
+ Raises:
43
+ ValueError: If the table could not be found.
44
  """
45
+ # extract everything between the table headline and the next headline
46
+ table = []
47
+ start = False
48
+ for line in multiline.split("\n"):
49
+ if line.startswith(table_headline):
50
+ start = True
51
+ elif line.startswith(next_headline_start):
52
+ start = False
53
+ elif start:
54
+ table.append(line + "\n")
55
+
56
+ if len(table) == 0:
57
+ raise ValueError(f"Could not find table with headline '{table_headline}'")
58
+
59
+ return "".join(table)
60
 
61
 
62
  def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
 
93
 
94
 
95
  def setup_basic():
96
+ title = "πŸ† LLM-Leaderboard"
97
 
98
  st.set_page_config(
99
  page_title=title,
 
110
  )
111
 
112
 
113
+ def setup_leaderboard(readme: str):
114
+ leaderboard_table = extract_markdown_table_from_multiline(readme, table_headline="## Leaderboard")
115
+ df_leaderboard = extract_table_and_format_from_markdown_text(leaderboard_table)
 
 
116
 
117
+ st.markdown("## Leaderboard")
118
+ st.dataframe(filter_dataframe(df_leaderboard))
119
 
120
 
121
+ def setup_benchmarks(readme: str):
122
+ benchmarks_table = extract_markdown_table_from_multiline(readme, table_headline="## Benchmarks")
123
+ df_benchmarks = extract_table_and_format_from_markdown_text(benchmarks_table)
124
 
125
+ st.markdown("## Covered Benchmarks")
126
 
127
+ selected_benchmark = st.selectbox("Select a benchmark to learn more:", df_benchmarks.index.unique())
128
+ df_selected = df_benchmarks.loc[selected_benchmark]
129
  text = [
130
  f"Name: {selected_benchmark} ",
131
  ]
 
134
  st.markdown("\n".join(text))
135
 
136
 
137
+ def setup_sources(readme: str):
138
+ sources_table = extract_markdown_table_from_multiline(readme, table_headline="## Sources")
139
+ df_sources = extract_table_and_format_from_markdown_text(sources_table)
140
 
141
+ st.markdown("## Sources of Above Figures")
142
 
143
+ selected_source = st.selectbox("Select a source to learn more:", df_sources.index.unique())
144
+ df_selected = df_sources.loc[selected_source]
145
  text = [
146
  f"Author: {selected_source} ",
147
  ]
 
161
 
162
  def main():
163
  setup_basic()
164
+
165
+ with open("README.md", "r") as f:
166
+ readme = f.read()
167
+
168
+ setup_leaderboard(readme)
169
+ setup_benchmarks(readme)
170
+ setup_sources(readme)
171
  setup_footer()
172
 
173