Spaces:

rstless-research
/

italian_open_llm_leaderboard

Runtime error

App Files Files Community

caesar-one commited on May 10, 2024

Commit

dfb91d7

1 Parent(s): 416947d

First commit!

Browse files

Files changed (3) hide show

README.md +47 -5
main.py +355 -0
requirements.txt +2 -0

README.md CHANGED Viewed

@@ -1,13 +1,55 @@
 ---
-title: Italian Open Llm Leaderboard
-emoji: 💻
 colorFrom: red
 colorTo: green
 sdk: streamlit
 sdk_version: 1.34.0
-app_file: app.py
-pinned: false
 license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Italian Open LLM Leaderboard
+emoji: 🤌
 colorFrom: red
 colorTo: green
 sdk: streamlit
 sdk_version: 1.34.0
+app_file: main.py
+pinned: True
 license: apache-2.0
 ---
+# 🏆 Italian LLM-Leaderboard
+Italian leaderboard
+## Leaderboard
+| Model Name                                                                                 | Year | Publisher                                 | Num. Parameters | Open? | Model Type    | Average | Average (Zero-shot) | Average (N-shot) | ARC Challenge (zero-shot) | ARC Challenge (25-shot) | HellaSwag (zero-shot) | HellaSwag (10-shot) | MMLU (zero-shot) | MMLU (5-shot) | TruthfulQA (zero-shot MC2) |
+|--------------------------------------------------------------------------------------------|------|-------------------------------------------|-----------------|-------|---------------|---------|---------------------|------------------|---------------------------|-------------------------|-----------------------|---------------------|------------------|---------------|----------------------------|
+| [DanteLLM](https://huggingface.co/rstless-research/DanteLLM-7B-Instruct-Italian-v0.1-GGUF) | 2023 | RSTLess (Sapienza University of Rome)     | 7B              | yes   | Italian FT    | 47.52   | 47.34               | 47.69            | 41.89                     | 47.01                   | 47.99                 | 47.79               | 47.05            | 48.27         | 52.41                      |
+| [OpenDanteLLM](https://huggingface.co/rstless-research/)                                   | 2023 | RSTLess (Sapienza University of Rome)     | 7B              | yes   | Italian  FT   | 45.97   | 45.13               | 46.80            | 41.72                     | 46.76                   | 46.49                 | 46.75               | 44.25            | 46.89         | 48.06                      |
+| [Mistral v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)                  | 2023 | Mistral  AI                               | 7B              | yes   | English       | 44.29   | 45.15               | 43.43            | 37.46                     | 41.47                   | 43.48                 | 42.99               | 44.66            | 45.84         | 54.99                      |
+| [LLaMAntino](https://huggingface.co/swap-uniba/LLaMAntino-2-7b-hf-ITA)                     | 2024 | Bari University                           | 7B              | yes   | Italian  FT   | 41.66   | 40.86               | 42.46            | 38.22                     | 41.72                   | 46.30                 | 46.91               | 33.89            | 38.74         | 45.03                      |
+| [Fauno2](https://huggingface.co/andreabac3/Fauno2-LLaMa2-7B)                               | 2023 | RSTLess (Sapienza University of Rome)     | 7B              | yes   | Italian  FT   | 41.74   | 42.90               | 40.57            | 36.26                     | 39.33                   | 44.25                 | 44.07               | 40.30            | 38.32         | 50.77                      |
+| [Fauno1](https://huggingface.co/andreabac3/Fauno2-LLaMa2-7B)                               | 2023 | RSTLess (Sapienza University of Rome)     | 7B              | yes   | Italian  FT   | 36.91   | 37.20               | 36.61            | 33.10                     | 36.52                   | 43.13                 | 42.86               | 28.79            | 30.45         | 43.78                      |
+| [Camoscio](https://huggingface.co/teelinsan/camoscio-7b-llama)                             | 2023 | Gladia  (Sapienza University of Rome)     | 7B              | yes   | Italian  FT   | 37.22   | 38.01               | 36.42            | 33.28                     | 36.60                   | 42.91                 | 43.29               | 30.53            | 29.38         | 45.33                      |
+| [LLaMA2](https://huggingface.co/meta-llama/Llama-2-7b)                                     | 2022 | Meta                                      | 7B              | yes   | English       | 39.50   | 39.14               | 39.86            | 33.28                     | 37.71                   | 44.31                 | 43.97               | 34.12            | 37.91         | 44.83                      |
+| [BloomZ](https://huggingface.co/bigscience/bloomz-7b1)                                     | 2022 | BigScience                                | 7B              | yes   | Multilingual  | 33.97   | 36.01               | 31.93            | 27.30                     | 28.24                   | 34.83                 | 35.88               | 36.40            | 31.67         | 45.52                      |
+| [iT5](https://huggingface.co/gsarti/it5-large)                                             | 2022 | Groningen University                      | 738M            | yes   | Italian       | 29.27   | 32.42               | 26.11            | 27.39                     | 27.99                   | 28.11                 | 26.04               | 23.69            | 24.31         | 50.49                      |
+| [GePpeTto](https://huggingface.co/LorenzoDeMattei/GePpeTto)                                | 2020 | Pisa/Groningen University, FBK, Aptus.AI  | 117M            | yes   | Italian       | 27.86   | 30.89               | 24.82            | 24.15                     | 25.08                   | 26.34                 | 24.99               | 22.87            | 24.39         | 50.20                      |
+| [mT5](https://huggingface.co/google/mt5-large)                                             | 2020 | Google                                    | 3.7B            | yes   | Multilingual  | 29.00   | 30.99               | 27.01            | 25.94                     | 27.56                   | 26.96                 | 27.86               | 25.56            | 25.60         | 45.50                      |
+| [Minerva 3B](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)                      | 2024 | SapienzaNLP (Sapienza University of Rome) | 3B              | yes   | Multilingual  | 33.94   | 34.37               | 33.52            | 30.29                     | 30.89                   | 42.38                 | 43.16               | 24.62            | 26.50         | 40.18                      |
+| [Minerva 1B](https://huggingface.co/sapienzanlp/Minerva-1B-base-v1.0)                      | 2024 | SapienzaNLP (Sapienza University of Rome) | 1B              | yes   | Multilingual  | 29.78   | 31.46               | 28.09            | 24.32                     | 25.25                   | 34.01                 | 34.07               | 24.69            | 24.94         | 42.84                      |
+| [Minerva 350M](https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0)                  | 2024 | SapienzaNLP (Sapienza University of Rome) | 350M            | yes   | Multilingual  | 28.35     | 30.72               | 26               | 23.21                     | 24.32                   | 29.33                 | 29.37               | 23.10            | 24.29         | 47.23                      |
+## Benchmarks
+| Benchmark Name    | Author           | Link                                     | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| ----------------- |------------------| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ARC Challenge     | Clark et al.     | https://arxiv.org/abs/1803.05457       | "We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.                                                                                                                                                                                                                                                                                                                                         |
+| HellaSwag         | Zellers et al.   | https://arxiv.org/abs/1905.07830v1       | "HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)                                                                                                                                                                                                                                                                                                                                             |
+| MMLU              | Hendrycks et al. | https://github.com/hendrycks/test        | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")                                             |
+| TruthfulQA        | Li et al.        | https://arxiv.org/abs/2109.07958       | "We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.                                                                                                                                                                          |
+## Authors
+- [Andrea Bacciu](https://www.linkedin.com/in/andreabacciu/)* (Work done prior joining Amazon)
+- [Cesare Campagnano](https://www.linkedin.com/in/caesar-one/)*
+- [Giovanni Trappolini](https://www.linkedin.com/in/giovanni-trappolini/)
+- [Professor Fabrizio Silvestri](https://www.linkedin.com/in/fabrizio-silvestri-a6b0391/)
+\* Equal contribution.
+### Ack
+Special thanks to https://github.com/LudwigStumpp/llm-leaderboard for the initial inspiration and codebase.

main.py ADDED Viewed

	@@ -0,0 +1,355 @@

+import io
+import re
+from typing import *
+import pandas as pd
+import streamlit as st
+from pandas.api.types import is_bool_dtype, is_datetime64_any_dtype, is_numeric_dtype
+GITHUB_URL = "https://github.com/RSTLess-research/"
+NON_BENCHMARK_COLS = ["Open?", "Publisher"]
+def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
+    """Extracts a table from a markdown text and formats it as a pandas DataFrame.
+    Args:
+        text (str): Markdown text containing a table.
+    Returns:
+        pd.DataFrame: Table as pandas DataFrame.
+    """
+    df = (
+        pd.read_table(io.StringIO(markdown_table), sep="|", header=0, index_col=1)
+        .dropna(axis=1, how="all")  # drop empty columns
+        .iloc[1:]  # drop first row which is the "----" separator of the original markdown table
+        .sort_index(ascending=True)
+        .apply(lambda x: x.str.strip() if x.dtype == "object" else x)
+        .replace("", float("NaN"))
+        .astype(float, errors="ignore")
+    )
+    # remove whitespace from column names and index
+    df.columns = df.columns.str.strip()
+    df.index = df.index.str.strip()
+    df.index.name = df.index.name.strip()
+    return df
+def extract_markdown_table_from_multiline(multiline: str, table_headline: str, next_headline_start: str = "#") -> str:
+    """Extracts the markdown table from a multiline string.
+    Args:
+        multiline (str): content of README.md file.
+        table_headline (str): Headline of the table in the README.md file.
+        next_headline_start (str, optional): Start of the next headline. Defaults to "#".
+    Returns:
+        str: Markdown table.
+    Raises:
+        ValueError: If the table could not be found.
+    """
+    # extract everything between the table headline and the next headline
+    table = []
+    start = False
+    for line in multiline.split("\n"):
+        if line.startswith(table_headline):
+            start = True
+        elif line.startswith(next_headline_start):
+            start = False
+        elif start:
+            table.append(line + "\n")
+    if len(table) == 0:
+        raise ValueError(f"Could not find table with headline '{table_headline}'")
+    return "".join(table)
+def remove_markdown_links(text: str) -> str:
+    """Modifies a markdown text to remove all markdown links.
+    Example: [DISPLAY](LINK) to DISPLAY
+    First find all markdown links with regex.
+    Then replace them with: $1
+    Args:
+        text (str): Markdown text containing markdown links
+    Returns:
+        str: Markdown text without markdown links.
+    """
+    # find all markdown links
+    markdown_links = re.findall(r"\[([^\]]+)\]\(([^)]+)\)", text)
+    # remove link keep display text
+    for display, link in markdown_links:
+        text = text.replace(f"[{display}]({link})", display)
+    return text
+def filter_dataframe_by_model_type(df: pd.DataFrame, model_type_column: str = 'Model Type', ignore_columns: List[str] = None) -> pd.DataFrame:
+    """
+    Filter dataframe by the rows based on model type and by user-selected columns.
+    This function provides a user interface to select model types and columns for filtering a DataFrame.
+    Model types are dynamically derived from the column specified as 'model_type_column'.
+    Args:
+        df (pd.DataFrame): Original dataframe.
+        model_type_column (str): Column name that contains model types for filtering.
+        ignore_columns (list[str], optional): Columns to ignore when showing in column selection. Defaults to None.
+    Returns:
+        pd.DataFrame: Filtered dataframe.
+    """
+    df = df.copy()
+    if ignore_columns is None:
+        ignore_columns = []
+    # Streamlit UI Container
+    modification_container = st.container()
+    with modification_container:
+        # Selection for model types
+        unique_model_types = sorted(df[model_type_column].unique())
+        selected_model_types = st.multiselect("Filter by model type:", unique_model_types)
+        # Filter dataframe by selected model types
+        if selected_model_types:
+            df = df[df[model_type_column].isin(selected_model_types)]
+        # Column selection excluding the model type column and any specified ignore columns
+        valid_columns = sorted(set(df.columns) - set(ignore_columns) - {model_type_column})
+        selected_columns = st.multiselect("Filter by columns:", valid_columns)
+        # Filter dataframe to include only the selected columns plus the model type column
+        if selected_columns:
+            df = pd.DataFrame(df[[model_type_column] + selected_columns])
+    return df
+def filter_dataframe_by_row_and_columns(df: pd.DataFrame, ignore_columns: List[str] = None) -> pd.DataFrame:
+    """
+    Filter dataframe by the rows and columns to display.
+    This does not select based on the values in the dataframe, but rather on the index and columns.
+    Modified from https://blog.streamlit.io/auto-generate-a-dataframe-filtering-ui-in-streamlit-with-filter_dataframe/
+    Args:
+        df (pd.DataFrame): Original dataframe
+        ignore_columns (list[str], optional): Columns to ignore. Defaults to None.
+    Returns:
+        pd.DataFrame: Filtered dataframe
+    """
+    df = df.copy()
+    if ignore_columns is None:
+        ignore_columns = []
+    modification_container = st.container()
+    with modification_container:
+        to_filter_index = st.multiselect("Filter by model:", sorted(df.index))
+        if to_filter_index:
+            df = pd.DataFrame(df.loc[to_filter_index])
+        to_filter_columns = st.multiselect(
+            "Filter by benchmark:", sorted([c for c in df.columns if c not in ignore_columns])
+        )
+        if to_filter_columns:
+            df = pd.DataFrame(df[ignore_columns + to_filter_columns])
+    return df
+def filter_dataframe_by_column_values(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Filter dataframe by the values in the dataframe.
+    Modified from https://blog.streamlit.io/auto-generate-a-dataframe-filtering-ui-in-streamlit-with-filter_dataframe/
+    Args:
+        df (pd.DataFrame): Original dataframe
+    Returns:
+        pd.DataFrame: Filtered dataframe
+    """
+    df = df.copy()
+    modification_container = st.container()
+    with modification_container:
+        to_filter_columns = st.multiselect("Filter results on:", df.columns)
+        left, right = st.columns((1, 20))
+        for column in to_filter_columns:
+            if is_bool_dtype(df[column]):
+                user_bool_input = right.checkbox(f"{column}", value=True)
+                df = df[df[column] == user_bool_input]
+            elif is_numeric_dtype(df[column]):
+                _min = float(df[column].min())
+                _max = float(df[column].max())
+                if (_min != _max) and pd.notna(_min) and pd.notna(_max):
+                    step = 0.01
+                    user_num_input = right.slider(
+                        f"Values for {column}:",
+                        min_value=round(_min - step, 2),
+                        max_value=round(_max + step, 2),
+                        value=(_min, _max),
+                        step=step,
+                    )
+                    df = df[df[column].between(*user_num_input)]
+            elif is_datetime64_any_dtype(df[column]):
+                user_date_input = right.date_input(
+                    f"Values for {column}:",
+                    value=(
+                        df[column].min(),
+                        df[column].max(),
+                    ),
+                )
+                if isinstance(user_date_input, Iterable) and len(user_date_input) == 2:
+                    user_date_input_datetime = tuple(map(pd.to_datetime, user_date_input))
+                    start_date, end_date = user_date_input_datetime
+                    df = df.loc[df[column].between(start_date, end_date)]
+            else:
+                selected_values = right.multiselect(
+                    f"Values for {column}:",
+                    sorted(df[column].unique()),
+                )
+                if selected_values:
+                    df = df[df[column].isin(selected_values)]
+    return df
+def setup_basic():
+    title = "🏆🇮🇹🤌 Italian LLM-Leaderboard"
+    st.set_page_config(
+        page_title=title,
+        page_icon="🏆🇮🇹🤌",
+        layout="wide",
+    )
+    st.title(title)
+    st.markdown(
+        "The Italian LLM Leaderboard published along with the paper DanteLLM: Let's Push Italian LLM Research Forward! 🤌🇮🇹🏆 (LREC-Coling 2024) \n"
+    )
+def setup_leaderboard(readme: str):
+    leaderboard_table = extract_markdown_table_from_multiline(readme, table_headline="## Leaderboard")
+    leaderboard_table = remove_markdown_links(leaderboard_table)
+    df_leaderboard = extract_table_and_format_from_markdown_text(leaderboard_table)
+    df_leaderboard["Open?"] = df_leaderboard["Open?"].map({"yes": 1, "no": 0}).astype(bool)
+    st.markdown("## Leaderboard")
+    modify = st.checkbox("Add filters")
+    if modify:
+        df_leaderboard = filter_dataframe_by_row_and_columns(df_leaderboard, ignore_columns=NON_BENCHMARK_COLS)
+        df_leaderboard = filter_dataframe_by_column_values(df_leaderboard)
+        df_leaderboard = filter_dataframe_by_model_type(df_leaderboard)
+    df_leaderboard = df_leaderboard.sort_values(by=['Average'], ascending=False)
+    df_leaderboard["Rank"] = df_leaderboard["Average"].rank(ascending=False)
+    # move rank at 0-th column
+    # Ensure 'Rank' is the first column
+    cols = ['Rank'] + [col for col in df_leaderboard.columns if col != 'Rank']
+    df_leaderboard = df_leaderboard[cols]
+    print(df_leaderboard.columns)
+    #df_leaderboard.reset_index(drop=True, inplace=True)
+    st.dataframe(df_leaderboard)
+    st.download_button(
+        "Download leaderboard as .html",
+        df_leaderboard.to_html().encode("utf-8"),
+        "leaderboard.html",
+        "text/html",
+        key="download-html",
+    )
+    st.download_button(
+        "Download leaderboard as .csv",
+        df_leaderboard.to_csv().encode("utf-8"),
+        "leaderboard.csv",
+        "text/csv",
+        key="download-csv",
+    )
+def setup_benchmarks(readme: str):
+    benchmarks_table = extract_markdown_table_from_multiline(readme, table_headline="## Benchmarks")
+    df_benchmarks = extract_table_and_format_from_markdown_text(benchmarks_table)
+    st.markdown("## Covered Benchmarks")
+    selected_benchmark = st.selectbox("Select a benchmark to learn more:", df_benchmarks.index.unique())
+    df_selected = df_benchmarks.loc[selected_benchmark]
+    text = [
+        f"Name: {selected_benchmark}",
+    ]
+    for key in df_selected.keys():
+        text.append(f"{key}: {df_selected[key]} ")
+    st.markdown("  \n".join(text))
+def setup_sources():
+    st.markdown("## Sources")
+    st.markdown(
+        "The results of this leaderboard are collected from the individual papers and published results of the model "
+        "authors. If you are interested in the sources of each individual reported model value, please visit the "
+        f"[llm-leaderboard]({GITHUB_URL}) repository."
+    )
+def setup_disclaimer():
+    st.markdown("## Authors")
+    st.markdown(
+"""
+- [Andrea Bacciu](https://www.linkedin.com/in/andreabacciu/) (Work done prior joining Amazon)
+- [Cesare Campagnano](https://www.linkedin.com/in/caesar-one/)
+- [Giovanni Trappolini](https://www.linkedin.com/in/giovanni-trappolini/)
+- [Professor Fabrizio Silvestri](https://www.linkedin.com/in/fabrizio-silvestri-a6b0391/)
+"""
+    )
+    st.markdown("## Ack")
+    st.markdown(
+        f"Special thanks to [llm-leaderboard](https://github.com/LudwigStumpp/llm-leaderboard) for the initial inspiration and codebase"
+    )
+def setup_footer():
+    st.markdown(
+        """
+        ---
+        Made with ❤️ by the awesome open-source italian community 🤌🇮🇹.
+        """
+    )
+def main():
+    setup_basic()
+    with open("README.md", "r") as f:
+        readme = f.read()
+    setup_leaderboard(readme)
+    setup_benchmarks(readme)
+    setup_sources()
+    setup_disclaimer()
+    setup_footer()
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ pandas~=2.0.1
2	+ streamlit~=1.22.0