caesar-one's picture
Small improvements.
af20151
|
raw
history blame
12.5 kB
metadata
title: Italian Open LLM Leaderboard
emoji: πŸ†
colorFrom: red
colorTo: green
sdk: streamlit
sdk_version: 1.34.0
app_file: main.py
pinned: true
license: apache-2.0

πŸ† Italian LLM-Leaderboard

Italian leaderboard

Leaderboard

Model Name Year Publisher # Params Lang. Avg. Avg. (0-shot) Avg. (N-shot) MMLU (0-shot) MMLU (5-shot) ARC-C (0-shot) ARC-C (25-shot) HellaSwag (0-shot) HellaSwag (10-shot) TruthfulQA (0-shot)
DanteLLM 2023 RSTLess (Sapienza University of Rome) 7B Italian FT 47.52 47.34 47.69 47.05 48.27 41.89 47.01 47.99 47.79 52.41
OpenDanteLLM 2023 RSTLess (Sapienza University of Rome) 7B Italian FT 45.97 45.13 46.80 44.25 46.89 41.72 46.76 46.49 46.75 48.06
Mistral v0.2 2023 Mistral AI 7B English 44.29 45.15 43.43 44.66 45.84 37.46 41.47 43.48 42.99 54.99
LLaMAntino 2024 Bari University 7B Italian FT 41.66 40.86 42.46 33.89 38.74 38.22 41.72 46.30 46.91 45.03
Fauno2 2023 RSTLess (Sapienza University of Rome) 7B Italian FT 41.74 42.90 40.57 40.30 38.32 36.26 39.33 44.25 44.07 50.77
Fauno1 2023 RSTLess (Sapienza University of Rome) 7B Italian FT 36.91 37.20 36.61 28.79 30.45 33.10 36.52 43.13 42.86 43.78
Camoscio 2023 Gladia (Sapienza University of Rome) 7B Italian FT 37.22 38.01 36.42 30.53 29.38 33.28 36.60 42.91 43.29 45.33
LLaMA2 2022 Meta 7B English 39.50 39.14 39.86 34.12 37.91 33.28 37.71 44.31 43.97 44.83
BloomZ 2022 BigScience 7B Multilingual 33.97 36.01 31.93 36.40 31.67 27.30 28.24 34.83 35.88 45.52
iT5 2022 Groningen University 738M Italian 29.27 32.42 26.11 23.69 24.31 27.39 27.99 28.11 26.04 50.49
GePpeTto 2020 Pisa/Groningen University, FBK, Aptus.AI 117M Italian 27.86 30.89 24.82 22.87 24.39 24.15 25.08 26.34 24.99 50.20
mT5 2020 Google 3.7B Multilingual 29.00 30.99 27.01 25.56 25.60 25.94 27.56 26.96 27.86 45.50
Minerva 3B 2024 SapienzaNLP (Sapienza University of Rome) 3B Multilingual 33.94 34.37 33.52 24.62 26.50 30.29 30.89 42.38 43.16 40.18
Minerva 1B 2024 SapienzaNLP (Sapienza University of Rome) 1B Multilingual 29.78 31.46 28.09 24.69 24.94 24.32 25.25 34.01 34.07 42.84
Minerva 350M 2024 SapienzaNLP (Sapienza University of Rome) 350M Multilingual 28.35 30.72 26 23.10 24.29 23.21 24.32 29.33 29.37 47.23

Benchmarks

Benchmark Name Author Link Description
ARC Challenge Clark et al. https://arxiv.org/abs/1803.05457 "We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.
HellaSwag Zellers et al. https://arxiv.org/abs/1905.07830v1 "HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)
MMLU Hendrycks et al. https://github.com/hendrycks/test "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")
TruthfulQA Li et al. https://arxiv.org/abs/2109.07958 "We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

Authors

* Equal contribution.

Ack

Special thanks to https://github.com/LudwigStumpp/llm-leaderboard for the initial inspiration and codebase.