caesar-one commited on
Commit
dfb91d7
โ€ข
1 Parent(s): 416947d

First commit!

Browse files
Files changed (3) hide show
  1. README.md +47 -5
  2. main.py +355 -0
  3. requirements.txt +2 -0
README.md CHANGED
@@ -1,13 +1,55 @@
1
  ---
2
- title: Italian Open Llm Leaderboard
3
- emoji: ๐Ÿ’ป
4
  colorFrom: red
5
  colorTo: green
6
  sdk: streamlit
7
  sdk_version: 1.34.0
8
- app_file: app.py
9
- pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Italian Open LLM Leaderboard
3
+ emoji: ๐ŸคŒ
4
  colorFrom: red
5
  colorTo: green
6
  sdk: streamlit
7
  sdk_version: 1.34.0
8
+ app_file: main.py
9
+ pinned: True
10
  license: apache-2.0
11
  ---
12
 
13
+ # ๐Ÿ† Italian LLM-Leaderboard
14
+
15
+ Italian leaderboard
16
+
17
+ ## Leaderboard
18
+ | Model Name | Year | Publisher | Num. Parameters | Open? | Model Type | Average | Average (Zero-shot) | Average (N-shot) | ARC Challenge (zero-shot) | ARC Challenge (25-shot) | HellaSwag (zero-shot) | HellaSwag (10-shot) | MMLU (zero-shot) | MMLU (5-shot) | TruthfulQA (zero-shot MC2) |
19
+ |--------------------------------------------------------------------------------------------|------|-------------------------------------------|-----------------|-------|---------------|---------|---------------------|------------------|---------------------------|-------------------------|-----------------------|---------------------|------------------|---------------|----------------------------|
20
+ | [DanteLLM](https://huggingface.co/rstless-research/DanteLLM-7B-Instruct-Italian-v0.1-GGUF) | 2023 | RSTLess (Sapienza University of Rome) | 7B | yes | Italian FT | 47.52 | 47.34 | 47.69 | 41.89 | 47.01 | 47.99 | 47.79 | 47.05 | 48.27 | 52.41 |
21
+ | [OpenDanteLLM](https://huggingface.co/rstless-research/) | 2023 | RSTLess (Sapienza University of Rome) | 7B | yes | Italian FT | 45.97 | 45.13 | 46.80 | 41.72 | 46.76 | 46.49 | 46.75 | 44.25 | 46.89 | 48.06 |
22
+ | [Mistral v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 2023 | Mistral AI | 7B | yes | English | 44.29 | 45.15 | 43.43 | 37.46 | 41.47 | 43.48 | 42.99 | 44.66 | 45.84 | 54.99 |
23
+ | [LLaMAntino](https://huggingface.co/swap-uniba/LLaMAntino-2-7b-hf-ITA) | 2024 | Bari University | 7B | yes | Italian FT | 41.66 | 40.86 | 42.46 | 38.22 | 41.72 | 46.30 | 46.91 | 33.89 | 38.74 | 45.03 |
24
+ | [Fauno2](https://huggingface.co/andreabac3/Fauno2-LLaMa2-7B) | 2023 | RSTLess (Sapienza University of Rome) | 7B | yes | Italian FT | 41.74 | 42.90 | 40.57 | 36.26 | 39.33 | 44.25 | 44.07 | 40.30 | 38.32 | 50.77 |
25
+ | [Fauno1](https://huggingface.co/andreabac3/Fauno2-LLaMa2-7B) | 2023 | RSTLess (Sapienza University of Rome) | 7B | yes | Italian FT | 36.91 | 37.20 | 36.61 | 33.10 | 36.52 | 43.13 | 42.86 | 28.79 | 30.45 | 43.78 |
26
+ | [Camoscio](https://huggingface.co/teelinsan/camoscio-7b-llama) | 2023 | Gladia (Sapienza University of Rome) | 7B | yes | Italian FT | 37.22 | 38.01 | 36.42 | 33.28 | 36.60 | 42.91 | 43.29 | 30.53 | 29.38 | 45.33 |
27
+ | [LLaMA2](https://huggingface.co/meta-llama/Llama-2-7b) | 2022 | Meta | 7B | yes | English | 39.50 | 39.14 | 39.86 | 33.28 | 37.71 | 44.31 | 43.97 | 34.12 | 37.91 | 44.83 |
28
+ | [BloomZ](https://huggingface.co/bigscience/bloomz-7b1) | 2022 | BigScience | 7B | yes | Multilingual | 33.97 | 36.01 | 31.93 | 27.30 | 28.24 | 34.83 | 35.88 | 36.40 | 31.67 | 45.52 |
29
+ | [iT5](https://huggingface.co/gsarti/it5-large) | 2022 | Groningen University | 738M | yes | Italian | 29.27 | 32.42 | 26.11 | 27.39 | 27.99 | 28.11 | 26.04 | 23.69 | 24.31 | 50.49 |
30
+ | [GePpeTto](https://huggingface.co/LorenzoDeMattei/GePpeTto) | 2020 | Pisa/Groningen University, FBK, Aptus.AI | 117M | yes | Italian | 27.86 | 30.89 | 24.82 | 24.15 | 25.08 | 26.34 | 24.99 | 22.87 | 24.39 | 50.20 |
31
+ | [mT5](https://huggingface.co/google/mt5-large) | 2020 | Google | 3.7B | yes | Multilingual | 29.00 | 30.99 | 27.01 | 25.94 | 27.56 | 26.96 | 27.86 | 25.56 | 25.60 | 45.50 |
32
+ | [Minerva 3B](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0) | 2024 | SapienzaNLP (Sapienza University of Rome) | 3B | yes | Multilingual | 33.94 | 34.37 | 33.52 | 30.29 | 30.89 | 42.38 | 43.16 | 24.62 | 26.50 | 40.18 |
33
+ | [Minerva 1B](https://huggingface.co/sapienzanlp/Minerva-1B-base-v1.0) | 2024 | SapienzaNLP (Sapienza University of Rome) | 1B | yes | Multilingual | 29.78 | 31.46 | 28.09 | 24.32 | 25.25 | 34.01 | 34.07 | 24.69 | 24.94 | 42.84 |
34
+ | [Minerva 350M](https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0) | 2024 | SapienzaNLP (Sapienza University of Rome) | 350M | yes | Multilingual | 28.35 | 30.72 | 26 | 23.21 | 24.32 | 29.33 | 29.37 | 23.10 | 24.29 | 47.23 |
35
+
36
+ ## Benchmarks
37
+
38
+ | Benchmark Name | Author | Link | Description |
39
+ | ----------------- |------------------| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
40
+ | ARC Challenge | Clark et al. | https://arxiv.org/abs/1803.05457 | "We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community. |
41
+ | HellaSwag | Zellers et al. | https://arxiv.org/abs/1905.07830v1 | "HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag) |
42
+ | MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a modelโ€™s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
43
+ | TruthfulQA | Li et al. | https://arxiv.org/abs/2109.07958 | "We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web. |
44
+
45
+
46
+ ## Authors
47
+ - [Andrea Bacciu](https://www.linkedin.com/in/andreabacciu/)* (Work done prior joining Amazon)
48
+ - [Cesare Campagnano](https://www.linkedin.com/in/caesar-one/)*
49
+ - [Giovanni Trappolini](https://www.linkedin.com/in/giovanni-trappolini/)
50
+ - [Professor Fabrizio Silvestri](https://www.linkedin.com/in/fabrizio-silvestri-a6b0391/)
51
+
52
+ \* Equal contribution.
53
+ ### Ack
54
+ Special thanks to https://github.com/LudwigStumpp/llm-leaderboard for the initial inspiration and codebase.
55
+
main.py ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ import re
3
+ from typing import *
4
+ import pandas as pd
5
+ import streamlit as st
6
+ from pandas.api.types import is_bool_dtype, is_datetime64_any_dtype, is_numeric_dtype
7
+
8
+ GITHUB_URL = "https://github.com/RSTLess-research/"
9
+ NON_BENCHMARK_COLS = ["Open?", "Publisher"]
10
+
11
+
12
+ def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
13
+ """Extracts a table from a markdown text and formats it as a pandas DataFrame.
14
+
15
+ Args:
16
+ text (str): Markdown text containing a table.
17
+
18
+ Returns:
19
+ pd.DataFrame: Table as pandas DataFrame.
20
+ """
21
+ df = (
22
+ pd.read_table(io.StringIO(markdown_table), sep="|", header=0, index_col=1)
23
+ .dropna(axis=1, how="all") # drop empty columns
24
+ .iloc[1:] # drop first row which is the "----" separator of the original markdown table
25
+ .sort_index(ascending=True)
26
+ .apply(lambda x: x.str.strip() if x.dtype == "object" else x)
27
+ .replace("", float("NaN"))
28
+ .astype(float, errors="ignore")
29
+ )
30
+
31
+ # remove whitespace from column names and index
32
+ df.columns = df.columns.str.strip()
33
+ df.index = df.index.str.strip()
34
+ df.index.name = df.index.name.strip()
35
+
36
+ return df
37
+
38
+
39
+ def extract_markdown_table_from_multiline(multiline: str, table_headline: str, next_headline_start: str = "#") -> str:
40
+ """Extracts the markdown table from a multiline string.
41
+
42
+ Args:
43
+ multiline (str): content of README.md file.
44
+ table_headline (str): Headline of the table in the README.md file.
45
+ next_headline_start (str, optional): Start of the next headline. Defaults to "#".
46
+
47
+ Returns:
48
+ str: Markdown table.
49
+
50
+ Raises:
51
+ ValueError: If the table could not be found.
52
+ """
53
+ # extract everything between the table headline and the next headline
54
+ table = []
55
+ start = False
56
+ for line in multiline.split("\n"):
57
+ if line.startswith(table_headline):
58
+ start = True
59
+ elif line.startswith(next_headline_start):
60
+ start = False
61
+ elif start:
62
+ table.append(line + "\n")
63
+
64
+ if len(table) == 0:
65
+ raise ValueError(f"Could not find table with headline '{table_headline}'")
66
+
67
+ return "".join(table)
68
+
69
+
70
+ def remove_markdown_links(text: str) -> str:
71
+ """Modifies a markdown text to remove all markdown links.
72
+ Example: [DISPLAY](LINK) to DISPLAY
73
+ First find all markdown links with regex.
74
+ Then replace them with: $1
75
+ Args:
76
+ text (str): Markdown text containing markdown links
77
+ Returns:
78
+ str: Markdown text without markdown links.
79
+ """
80
+
81
+ # find all markdown links
82
+ markdown_links = re.findall(r"\[([^\]]+)\]\(([^)]+)\)", text)
83
+
84
+ # remove link keep display text
85
+ for display, link in markdown_links:
86
+ text = text.replace(f"[{display}]({link})", display)
87
+
88
+ return text
89
+
90
+ def filter_dataframe_by_model_type(df: pd.DataFrame, model_type_column: str = 'Model Type', ignore_columns: List[str] = None) -> pd.DataFrame:
91
+ """
92
+ Filter dataframe by the rows based on model type and by user-selected columns.
93
+
94
+ This function provides a user interface to select model types and columns for filtering a DataFrame.
95
+ Model types are dynamically derived from the column specified as 'model_type_column'.
96
+
97
+ Args:
98
+ df (pd.DataFrame): Original dataframe.
99
+ model_type_column (str): Column name that contains model types for filtering.
100
+ ignore_columns (list[str], optional): Columns to ignore when showing in column selection. Defaults to None.
101
+
102
+ Returns:
103
+ pd.DataFrame: Filtered dataframe.
104
+ """
105
+ df = df.copy()
106
+
107
+ if ignore_columns is None:
108
+ ignore_columns = []
109
+
110
+ # Streamlit UI Container
111
+ modification_container = st.container()
112
+
113
+ with modification_container:
114
+ # Selection for model types
115
+ unique_model_types = sorted(df[model_type_column].unique())
116
+ selected_model_types = st.multiselect("Filter by model type:", unique_model_types)
117
+
118
+ # Filter dataframe by selected model types
119
+ if selected_model_types:
120
+ df = df[df[model_type_column].isin(selected_model_types)]
121
+
122
+ # Column selection excluding the model type column and any specified ignore columns
123
+ valid_columns = sorted(set(df.columns) - set(ignore_columns) - {model_type_column})
124
+ selected_columns = st.multiselect("Filter by columns:", valid_columns)
125
+
126
+ # Filter dataframe to include only the selected columns plus the model type column
127
+ if selected_columns:
128
+ df = pd.DataFrame(df[[model_type_column] + selected_columns])
129
+
130
+ return df
131
+
132
+ def filter_dataframe_by_row_and_columns(df: pd.DataFrame, ignore_columns: List[str] = None) -> pd.DataFrame:
133
+ """
134
+ Filter dataframe by the rows and columns to display.
135
+
136
+ This does not select based on the values in the dataframe, but rather on the index and columns.
137
+ Modified from https://blog.streamlit.io/auto-generate-a-dataframe-filtering-ui-in-streamlit-with-filter_dataframe/
138
+
139
+ Args:
140
+ df (pd.DataFrame): Original dataframe
141
+ ignore_columns (list[str], optional): Columns to ignore. Defaults to None.
142
+
143
+ Returns:
144
+ pd.DataFrame: Filtered dataframe
145
+ """
146
+ df = df.copy()
147
+
148
+ if ignore_columns is None:
149
+ ignore_columns = []
150
+
151
+ modification_container = st.container()
152
+
153
+ with modification_container:
154
+ to_filter_index = st.multiselect("Filter by model:", sorted(df.index))
155
+ if to_filter_index:
156
+ df = pd.DataFrame(df.loc[to_filter_index])
157
+
158
+ to_filter_columns = st.multiselect(
159
+ "Filter by benchmark:", sorted([c for c in df.columns if c not in ignore_columns])
160
+ )
161
+ if to_filter_columns:
162
+ df = pd.DataFrame(df[ignore_columns + to_filter_columns])
163
+
164
+ return df
165
+
166
+
167
+ def filter_dataframe_by_column_values(df: pd.DataFrame) -> pd.DataFrame:
168
+ """
169
+ Filter dataframe by the values in the dataframe.
170
+
171
+ Modified from https://blog.streamlit.io/auto-generate-a-dataframe-filtering-ui-in-streamlit-with-filter_dataframe/
172
+
173
+ Args:
174
+ df (pd.DataFrame): Original dataframe
175
+
176
+ Returns:
177
+ pd.DataFrame: Filtered dataframe
178
+ """
179
+ df = df.copy()
180
+
181
+ modification_container = st.container()
182
+
183
+ with modification_container:
184
+ to_filter_columns = st.multiselect("Filter results on:", df.columns)
185
+ left, right = st.columns((1, 20))
186
+
187
+ for column in to_filter_columns:
188
+ if is_bool_dtype(df[column]):
189
+ user_bool_input = right.checkbox(f"{column}", value=True)
190
+ df = df[df[column] == user_bool_input]
191
+
192
+ elif is_numeric_dtype(df[column]):
193
+ _min = float(df[column].min())
194
+ _max = float(df[column].max())
195
+
196
+ if (_min != _max) and pd.notna(_min) and pd.notna(_max):
197
+ step = 0.01
198
+ user_num_input = right.slider(
199
+ f"Values for {column}:",
200
+ min_value=round(_min - step, 2),
201
+ max_value=round(_max + step, 2),
202
+ value=(_min, _max),
203
+ step=step,
204
+ )
205
+ df = df[df[column].between(*user_num_input)]
206
+
207
+ elif is_datetime64_any_dtype(df[column]):
208
+ user_date_input = right.date_input(
209
+ f"Values for {column}:",
210
+ value=(
211
+ df[column].min(),
212
+ df[column].max(),
213
+ ),
214
+ )
215
+ if isinstance(user_date_input, Iterable) and len(user_date_input) == 2:
216
+ user_date_input_datetime = tuple(map(pd.to_datetime, user_date_input))
217
+ start_date, end_date = user_date_input_datetime
218
+ df = df.loc[df[column].between(start_date, end_date)]
219
+
220
+ else:
221
+ selected_values = right.multiselect(
222
+ f"Values for {column}:",
223
+ sorted(df[column].unique()),
224
+ )
225
+
226
+ if selected_values:
227
+ df = df[df[column].isin(selected_values)]
228
+
229
+ return df
230
+
231
+
232
+ def setup_basic():
233
+ title = "๐Ÿ†๐Ÿ‡ฎ๐Ÿ‡น๐ŸคŒ Italian LLM-Leaderboard"
234
+
235
+ st.set_page_config(
236
+ page_title=title,
237
+ page_icon="๐Ÿ†๐Ÿ‡ฎ๐Ÿ‡น๐ŸคŒ",
238
+ layout="wide",
239
+ )
240
+ st.title(title)
241
+
242
+ st.markdown(
243
+ "The Italian LLM Leaderboard published along with the paper DanteLLM: Let's Push Italian LLM Research Forward! ๐ŸคŒ๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ† (LREC-Coling 2024) \n"
244
+ )
245
+
246
+ def setup_leaderboard(readme: str):
247
+ leaderboard_table = extract_markdown_table_from_multiline(readme, table_headline="## Leaderboard")
248
+ leaderboard_table = remove_markdown_links(leaderboard_table)
249
+ df_leaderboard = extract_table_and_format_from_markdown_text(leaderboard_table)
250
+ df_leaderboard["Open?"] = df_leaderboard["Open?"].map({"yes": 1, "no": 0}).astype(bool)
251
+
252
+ st.markdown("## Leaderboard")
253
+ modify = st.checkbox("Add filters")
254
+
255
+ if modify:
256
+ df_leaderboard = filter_dataframe_by_row_and_columns(df_leaderboard, ignore_columns=NON_BENCHMARK_COLS)
257
+ df_leaderboard = filter_dataframe_by_column_values(df_leaderboard)
258
+ df_leaderboard = filter_dataframe_by_model_type(df_leaderboard)
259
+
260
+ df_leaderboard = df_leaderboard.sort_values(by=['Average'], ascending=False)
261
+ df_leaderboard["Rank"] = df_leaderboard["Average"].rank(ascending=False)
262
+ # move rank at 0-th column
263
+ # Ensure 'Rank' is the first column
264
+ cols = ['Rank'] + [col for col in df_leaderboard.columns if col != 'Rank']
265
+ df_leaderboard = df_leaderboard[cols]
266
+
267
+ print(df_leaderboard.columns)
268
+ #df_leaderboard.reset_index(drop=True, inplace=True)
269
+
270
+ st.dataframe(df_leaderboard)
271
+
272
+ st.download_button(
273
+ "Download leaderboard as .html",
274
+ df_leaderboard.to_html().encode("utf-8"),
275
+ "leaderboard.html",
276
+ "text/html",
277
+ key="download-html",
278
+ )
279
+
280
+ st.download_button(
281
+ "Download leaderboard as .csv",
282
+ df_leaderboard.to_csv().encode("utf-8"),
283
+ "leaderboard.csv",
284
+ "text/csv",
285
+ key="download-csv",
286
+ )
287
+
288
+
289
+ def setup_benchmarks(readme: str):
290
+ benchmarks_table = extract_markdown_table_from_multiline(readme, table_headline="## Benchmarks")
291
+ df_benchmarks = extract_table_and_format_from_markdown_text(benchmarks_table)
292
+
293
+ st.markdown("## Covered Benchmarks")
294
+
295
+ selected_benchmark = st.selectbox("Select a benchmark to learn more:", df_benchmarks.index.unique())
296
+ df_selected = df_benchmarks.loc[selected_benchmark]
297
+ text = [
298
+ f"Name: {selected_benchmark}",
299
+ ]
300
+ for key in df_selected.keys():
301
+ text.append(f"{key}: {df_selected[key]} ")
302
+ st.markdown(" \n".join(text))
303
+
304
+
305
+ def setup_sources():
306
+ st.markdown("## Sources")
307
+ st.markdown(
308
+ "The results of this leaderboard are collected from the individual papers and published results of the model "
309
+ "authors. If you are interested in the sources of each individual reported model value, please visit the "
310
+ f"[llm-leaderboard]({GITHUB_URL}) repository."
311
+ )
312
+
313
+
314
+ def setup_disclaimer():
315
+
316
+ st.markdown("## Authors")
317
+ st.markdown(
318
+ """
319
+ - [Andrea Bacciu](https://www.linkedin.com/in/andreabacciu/) (Work done prior joining Amazon)
320
+ - [Cesare Campagnano](https://www.linkedin.com/in/caesar-one/)
321
+ - [Giovanni Trappolini](https://www.linkedin.com/in/giovanni-trappolini/)
322
+ - [Professor Fabrizio Silvestri](https://www.linkedin.com/in/fabrizio-silvestri-a6b0391/)
323
+ """
324
+ )
325
+
326
+ st.markdown("## Ack")
327
+ st.markdown(
328
+ f"Special thanks to [llm-leaderboard](https://github.com/LudwigStumpp/llm-leaderboard) for the initial inspiration and codebase"
329
+ )
330
+
331
+
332
+ def setup_footer():
333
+ st.markdown(
334
+ """
335
+ ---
336
+ Made with โค๏ธ by the awesome open-source italian community ๐ŸคŒ๐Ÿ‡ฎ๐Ÿ‡น.
337
+ """
338
+ )
339
+
340
+
341
+ def main():
342
+ setup_basic()
343
+
344
+ with open("README.md", "r") as f:
345
+ readme = f.read()
346
+
347
+ setup_leaderboard(readme)
348
+ setup_benchmarks(readme)
349
+ setup_sources()
350
+ setup_disclaimer()
351
+ setup_footer()
352
+
353
+
354
+ if __name__ == "__main__":
355
+ main()
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ pandas~=2.0.1
2
+ streamlit~=1.22.0