metadata

title: Italian Open LLM Leaderboard
emoji: 🏆
colorFrom: red
colorTo: green
sdk: streamlit
sdk_version: 1.34.0
app_file: main.py
pinned: true
license: apache-2.0

🏆 Italian LLM-Leaderboard

Italian leaderboard

Leaderboard

Model Name	Year	Publisher	# Params	Lang.	Avg.	Avg. (0-shot)	Avg. (N-shot)	MMLU (0-shot)	MMLU (5-shot)	ARC-C (0-shot)	ARC-C (25-shot)	HellaSwag (0-shot)	HellaSwag (10-shot)	TruthfulQA (0-shot)
DanteLLM	2023	RSTLess (Sapienza University of Rome)	7B	Italian FT	47.52	47.34	47.69	47.05	48.27	41.89	47.01	47.99	47.79	52.41
OpenDanteLLM	2023	RSTLess (Sapienza University of Rome)	7B	Italian FT	45.97	45.13	46.80	44.25	46.89	41.72	46.76	46.49	46.75	48.06
Mistral v0.2	2023	Mistral AI	7B	English	44.29	45.15	43.43	44.66	45.84	37.46	41.47	43.48	42.99	54.99
LLaMAntino	2024	Bari University	7B	Italian FT	41.66	40.86	42.46	33.89	38.74	38.22	41.72	46.30	46.91	45.03
Fauno2	2023	RSTLess (Sapienza University of Rome)	7B	Italian FT	41.74	42.90	40.57	40.30	38.32	36.26	39.33	44.25	44.07	50.77
Fauno1	2023	RSTLess (Sapienza University of Rome)	7B	Italian FT	36.91	37.20	36.61	28.79	30.45	33.10	36.52	43.13	42.86	43.78
Camoscio	2023	Gladia (Sapienza University of Rome)	7B	Italian FT	37.22	38.01	36.42	30.53	29.38	33.28	36.60	42.91	43.29	45.33
LLaMA2	2022	Meta	7B	English	39.50	39.14	39.86	34.12	37.91	33.28	37.71	44.31	43.97	44.83
BloomZ	2022	BigScience	7B	Multilingual	33.97	36.01	31.93	36.40	31.67	27.30	28.24	34.83	35.88	45.52
iT5	2022	Groningen University	738M	Italian	29.27	32.42	26.11	23.69	24.31	27.39	27.99	28.11	26.04	50.49
GePpeTto	2020	Pisa/Groningen University, FBK, Aptus.AI	117M	Italian	27.86	30.89	24.82	22.87	24.39	24.15	25.08	26.34	24.99	50.20
mT5	2020	Google	3.7B	Multilingual	29.00	30.99	27.01	25.56	25.60	25.94	27.56	26.96	27.86	45.50
Minerva 3B	2024	SapienzaNLP (Sapienza University of Rome)	3B	Multilingual	33.94	34.37	33.52	24.62	26.50	30.29	30.89	42.38	43.16	40.18
Minerva 1B	2024	SapienzaNLP (Sapienza University of Rome)	1B	Multilingual	29.78	31.46	28.09	24.69	24.94	24.32	25.25	34.01	34.07	42.84
Minerva 350M	2024	SapienzaNLP (Sapienza University of Rome)	350M	Multilingual	28.35	30.72	26	23.10	24.29	23.21	24.32	29.33	29.37	47.23
Modello Italia	2024	iGenius	9B	Italian	41.22	40.89	41.67	39.76	41.01	34.81	39.16	43.97	44.85	45.01

Benchmarks

Benchmark Name	Author	Link	Description
ARC Challenge	Clark et al.	https://arxiv.org/abs/1803.05457	"We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.
HellaSwag	Zellers et al.	https://arxiv.org/abs/1905.07830v1	"HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)
MMLU	Hendrycks et al.	https://github.com/hendrycks/test	"The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")
TruthfulQA	Li et al.	https://arxiv.org/abs/2109.07958	"We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

Authors

Andrea Bacciu* (Work done prior joining Amazon)
Cesare Campagnano*
Giovanni Trappolini
Prof. Fabrizio Silvestri

* Equal contribution.

Ack

Special thanks to https://github.com/LudwigStumpp/llm-leaderboard for the initial inspiration and codebase.