Spaces:

StarscreamDeceptions
/

Multilingual-MMLU-Benchmark-Leaderboard

Running

App Files Files Community

Multilingual-MMLU-Benchmark-Leaderboard / src /about.py

StarscreamDeceptions

Update src/about.py

be2f1ff verified 7 months ago

raw

history blame contribute delete

21.4 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str


	# Select your tasks here
	# ---------------------------------------------------
	class Tasks(Enum):
	# task_key in the json file, metric_key in the json file, name to display in the leaderboard
	# task0 = Task("mmmlu", "acc", "MMMLU")
	# task1 = Task("mmlu", "acc", "MMLU")
	# task2 = Task("cmmlu", "acc", "CMMLU")
	mmmlu_ar = Task("mmmlu_ar", "acc", "AR")
	mmmlu_bn = Task("mmmlu_bn", "acc", "BN")
	mmmlu_de = Task("mmmlu_de", "acc", "DE")
	mmmlu_es = Task("mmmlu_es", "acc", "ES")
	mmmlu_fr = Task("mmmlu_fr", "acc", "FR")
	mmmlu_hi = Task("mmmlu_hi", "acc", "HI")
	mmmlu_id = Task("mmmlu_id", "acc", "ID")
	mmmlu_it = Task("mmmlu_it", "acc", "IT")
	mmmlu_ja = Task("mmmlu_ja", "acc", "JA")
	mmmlu_ko = Task("mmmlu_ko", "acc", "KO")
	mmmlu_pt = Task("mmmlu_pt", "acc", "PT")
	mmmlu_sw = Task("mmmlu_sw", "acc", "SW")
	mmmlu_yo = Task("mmmlu_yo", "acc", "YO")
	mmmlu_zh = Task("mmmlu_zh", "acc", "ZH")
	NUM_FEWSHOT = 5 # Change with your few shot
	# ---------------------------------------------------



	# Your leaderboard name
	TITLE = """<img src="https://raw.githubusercontent.com/BobTsang1995/Multilingual-MMLU-Benchmark-Leaderboard/main/static/title/title.png" style="width:30%;display:block;margin-left:auto;margin-right:auto;border-radius:15px;">"""


	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	🌍 Multilingual MMLU Benchmark Leaderboard: This leaderboard is dedicated to evaluating and comparing the multilingual capabilities of large language models across different languages and cultures.

	🔬 MMMLU Dataset: The dataset used for evaluation is [OpenAI MMMLU Benchmark](https://huggingface.co/datasets/openai/MMMLU), which covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science. MMMLU contains 14 languages: AR_XY (Arabic), BN_BD (Bengali), DE_DE (German), ES_LA (Spanish), FR_FR (French), HI_IN (Hindi), ID_ID (Indonesian), IT_IT (Italian), JA_JP (Japanese), KO_KR (Korean), PT_BR (Brazilian Portuguese), SW_KE (Swahili), YO_NG (Yoruba), ZH_CN (Simplified Chinese).

	🎯 Our Goal is to raise awareness about the importance of improving the performance of LLMs across various languages, with a particular focus on cultural contexts. We strive to make LLM more inclusive and effective for users worldwide.
	"""
	INTRODUCTION_TEXT_ZH = """
	多语言 MMLU 基准榜单：这是一个开放的评测榜单，旨在评估开源和闭源语言模型在多语言 MMLU 基准测试中的表现，涵盖记忆、推理和语言能力。该榜单整合了多个 MMLU 数据集，这些数据集最初为多种语言创建或手动翻译，旨在全面评估大规模语言模型在多语言理解上的能力。
	"""
	# Which evaluations are you running? how can people reproduce what you have?
	# TODO: Update number of benchmarks
	LLM_BENCHMARKS_TEXT = """
	## 💡 About "Multilingual Benchmark MMLU Leaderboard"

	### Overview
	The Multilingual Massive Multitask Language Understanding (MMMLU) benchmark is a comprehensive evaluation platform designed to assess the general knowledge capabilities of AI models across a wide range of domains. It includes a series of Question Answering (QA) tasks across 57 distinct domains, ranging from elementary-level knowledge to advanced professional subjects such as law, physics, history, and computer science.

	### Translation Effort
	For this evaluation, we used the OpenAI MMMLU dataset, which has been extensively curated and tested for a multilingual understanding of AI models. The dataset includes 14 different languages and is specifically designed to assess how well AI models can handle a wide range of general knowledge tasks across 57 domains.

	While the translation of the test set was performed by OpenAI, it ensures a high level of accuracy and reliability for evaluating multilingual models. By leveraging this pre-existing, professionally curated dataset, we aim to focus on model performance across multiple languages, without the need for additional translations from our side.

	### Commitment to Multilingual AI
	By focusing on human-powered translations and publishing both the translated test sets and evaluation code, we aim to promote the development of AI models that can handle multilingual tasks with greater accuracy. This reflects our commitment to improving AI’s performance in underrepresented languages and making technology more inclusive and effective globally.

	### Locales Covered
	The MMMLU benchmark includes a test set translated into the following locales:
	- AR_XY: Arabic
	- BN_BD: Bengali
	- DE_DE: German
	- ES_LA: Spanish
	- FR_FR: French
	- HI_IN: Hindi
	- ID_ID: Indonesian
	- IT_IT: Italian
	- JA_JP: Japanese
	- KO_KR: Korean
	- PT_BR: Brazilian Portuguese
	- SW_KE: Swahili
	- YO_NG: Yoruba
	- ZH_CN: Simplified Chinese

	### Purpose
	The MMMLU Leaderboard aims to provide a unified benchmark for comparing AI model performance across these multiple languages and diverse domains. With the inclusion of the QA task across 57 domains, it evaluates how well models perform in answering general knowledge questions in multiple languages, ensuring a high standard of multilingual understanding and reasoning.

	### Goals
	Our primary goal is to provide a reliable comparison for AI models across different languages and domains, helping developers and researchers evaluate and improve their models’ multilingual capabilities. By emphasizing high-quality translations and including a broad range of topics, we strive to make AI models more robust and useful across diverse communities worldwide.

	### 🤗 How it works

	Submit a model for automated evaluation on our clusters on the "Submit here" tab!

	### 📈 Tasks

	We evaluate models on a variety of key benchmarks, with a focus on Multilingual Massive Multitask Language Understanding (MMLU) and its variants, including MMLU, C-MMLU, ArabicMMLU, KoreanMMLU, MalayMMLU, and others. These benchmarks assess general knowledge across a wide range of topics from 57 categories, such as law, physics, history, and computer science.

	The evaluation is performed using the [OpenCompass framework](https://github.com/open-compass/opencompass), a unified platform for evaluating language models across multiple tasks. OpenCompass allows us to execute these evaluations efficiently and at scale, covering multiple languages and benchmarks.

	For detailed information on the tasks, please refer to the "Tasks" tab in the OpenCompass framework.

	Notes:
	- The evaluations are all 5-shot.
	- Results are aggregated by calculating the average of all the tasks for a given language.

	### 🔎 Results

	You can find:

	- Detailed numerical results in the [results dataset](https://huggingface.co/datasets/StarscreamDeceptions/results)
	- Community queries and running status in the [requests dataset](https://huggingface.co/datasets/StarscreamDeceptions/requests)

	### ✅ Reproducibility

	To reproduce the results, you can use [opencompass](https://github.com/BobTsang1995/opencompass), Since many open-source models cannot fully adhere to instructions for QA tasks, we perform post-processing on the results by using Qwen2.5-7B-Instruct to extract the answers from the model's output. This is a relatively simple task, so we can generally extract the model's true output, which corresponds to options A, B, C, and D. As not all of our PRs are currently integrated into the main repository.
	```
	git clone git@github.com:BobTsang1995/opencompass.git
	cd opencompass
	pip install -e .
	pip install lmdeploy
	python run.py --models lmdeploy_qwen2_7b_instruct --datasets mmmlu_gen_5_shot -a lmdeploy
	```

	## 🙌 Acknowledgements

	This leaderboard was independently developed as a non-profit initiative with the support of several academic institutions, which provided valuable assistance to make this effort possible. We extend our heartfelt gratitude to these institutions for their support.


	- [Technische Universität München (TUM)](https://www.tum.de/)
	- [Tsinghua University](https://www.tsinghua.edu.cn/en/)
	- [Universiteit van Amsterdam](https://uva.nl/)
	- [Mohamed Bin Zayed University of Artificial Intelligence](https://mbzuai.ac.ae/)
	- [University of Macau](https://www.um.edu.mo/)
	- [Cardiff University](https://www.cardiff.ac.uk/)
	- [Nara Institute of Science and Technology](https://www.naist.jp/en/)
	- [Shanghai Jiao Tong University](https://en.sjtu.edu.cn/)
	- [Dublin City University](https://www.dcu.ie/)
	- [Université Grenoble Alpes](https://www.univ-grenoble-alpes.fr/)
	- [Universidade de Coimbra](https://www.uc.pt/)
	- [The Ohio State University](https://www.osu.edu/)
	- [RMIT University](https://www.rmit.edu.au/)

	The entities above are ordered chronologically by the date they joined the project. However, the logos in the footer are ordered by the number of datasets donated.

	Thank you in particular to:
	Yi Zhou (Cardiff University), Yusuke Sakai (Nara Institute of Science and Technology), Yongxin Zhou (Université Grenoble Alpes), Haonan Li (MBZUAI), Jiahui Geng (MBZUAI), Qing Li (MBZUAI), Wenxi Li (Tsinghua University/Shanghai Jiaotong University), Yuanyu Lin (University of Macau), Andy Way (Dublin City University), Zhuang Li (RMIT University), Zhongwei Wan (The Ohio State University), Di Wu (University of Amsterdam), Wen Lai (Technical University of Munich) (TUM)

	For information about the dataset authors please check the corresponding Dataset Cards (linked in the "Tasks" tab) and papers (included in the "Citation" section below). We would like to specially thank the teams that created or open-sourced their datasets specifically for the leaderboard (in chronological order):
	- [MMMLU](https://huggingface.co/datasets/openai/MMMLU): OpenAI


	We also thank MacroPolo Team, Alibaba International Digital Commerce for sponsoring the inference GPUs.

	## 🚀 Collaborate!

	We would like to create a leaderboard as diverse as possible, reach out if you would like us to include your evaluation dataset!

	Comments and suggestions are more than welcome! Visit the [👏 Multilingual-MMLU-Benchmark-Leaderboard discussions](https://huggingface.co/spaces/StarscreamDeceptions/Multilingual-MMLU-Benchmark-Leaderboard/discussions) page, tell us what you think about MMMLU Leaderboard and how we can improve it, or go ahead and open a PR!

	Thank you very much! 💛

	"""

	LLM_BENCHMARKS_TEXT_ZH = """
	## 💡 关于 "多语言基准 MMLU 排行榜"

	- 新闻稿：[待定 - XXX](#), [待定 - XXX](#), [待定 - XXX](#), [待定 - XXX](#)
	- YouTube：[待定 - XXX](#)

	### 概述
	多语言大规模多任务语言理解 (MMMLU) 基准是一个全面的评估平台，旨在评估 AI 模型在各个领域的通用知识能力。它包括一系列跨越 57 个不同领域的问答 (QA) 任务，从基础知识到法律、物理、历史、计算机科学等高级专业主题。

	### 翻译工作
	对于本次评估，我们使用了 OpenAI MMMLU 数据集，该数据集已经广泛策划并测试了 AI 模型的多语言理解能力。该数据集包括 14 种不同的语言，专门设计用来评估 AI 模型在 57 个领域中处理各种通用知识任务的能力。

	尽管测试集的翻译是由 OpenAI 执行的，但它确保了评估多语言模型的高准确性和可靠性。通过利用这个预先存在的、专业策划的数据集，我们旨在专注于模型在多种语言中的表现，而无需我们额外进行翻译工作。

	### 致力于多语言 AI
	通过专注于人工翻译并公开翻译后的测试集和评估代码，我们旨在促进能够处理多语言任务的 AI 模型的开发，并使其更加准确。这也体现了我们在改善 AI 在低资源语言中的表现以及推动全球技术包容性方面的承诺。

	### 涵盖的语言区域
	MMMLU 基准包括以下语言区域的翻译测试集：
	- AR_XY：阿拉伯语
	- BN_BD：孟加拉语
	- DE_DE：德语
	- ES_LA：西班牙语
	- FR_FR：法语
	- HI_IN：印地语
	- ID_ID：印尼语
	- IT_IT：意大利语
	- JA_JP：日语
	- KO_KR：韩语
	- PT_BR：巴西葡萄牙语
	- SW_KE：斯瓦希里语
	- YO_NG：约鲁巴语
	- ZH_CN：简体中文

	### 目的
	MMMLU 排行榜旨在为比较 AI 模型在这些多语言和多领域中的表现提供统一的基准。通过包括 57 个领域中的问答任务，它评估了模型在多语言中回答通用知识问题的能力，确保了多语言理解和推理的高标准。

	### 目标
	我们的主要目标是为 AI 模型在不同语言和领域中的表现提供可靠的比较，帮助开发者和研究人员评估和提高他们模型的多语言能力。通过强调高质量的翻译和包括广泛的主题，我们努力使 AI 模型在全球不同社区中更加稳健和有用。

	### 🤗 工作原理

	在 "提交这里" 标签页上提交模型进行自动评估！

	### 📈 任务

	我们评估模型在多个关键基准上的表现，重点关注多语言大规模多任务语言理解 (MMLU) 及其变体，包括 MMLU、C-MMLU、阿拉伯语 MMLU、韩语 MMLU、马来语 MMLU 等。这些基准评估了来自 57 个类别（如法律、物理、历史和计算机科学等）的一般知识。

	评估使用 [OpenCompass 框架](https://github.com/open-compass/opencompass) 执行，该平台统一了对语言模型的多任务评估。OpenCompass 使我们能够高效地、大规模地执行这些评估，覆盖多种语言和基准。

	有关任务的详细信息，请参见 OpenCompass 框架中的 "任务" 标签页。

	注：
	- 所有评估均为 5-shot 任务。
	- 结果采用以下公式标准化：`normalized_value = (raw_value - random_baseline) / (max_value - random_baseline)`，其中 `random_baseline` 对于生成任务为 `0`，对于多选 QA 为 `1/n`（`n` 为选择数）。
	- 结果通过计算给定语言的所有任务的平均值来汇总。

	### 🔎 结果

	你可以找到：

	- 详细的数值结果在 [结果数据集](link_to_results)
	- 社区查询和运行状态在 [请求数据集](link_to_requests)

	### ✅ 可复现性

	要复现结果，你可以使用 [我们 fork 的 lm_eval](#)，因为并非所有的 PR 都已集成到主仓库中。

	## 🙌 致谢

	这个排行榜是 [#ProjectName](link_to_project) 项目的一部分，由 [OrganizationName](link_to_organization) 领导，感谢以下机构捐赠了高质量的评估数据集：

	- [机构 1](link_to_institution_1)
	- [机构 2](link_to_institution_2)
	- [机构 3](link_to_institution_3)
	- [机构 4](link_to_institution_4)
	- [机构 5](link_to_institution_5)
	- [机构 6](link_to_institution_6)
	- [机构 7](link_to_institution_7)
	- [机构 8](link_to_institution_8)
	- [机构 9](link_to_institution_9)

	这些实体按加入项目的时间顺序排列，然而页脚的 logo 排列顺序是按照捐赠数据集的数量。

	特别感谢：
	- 任务实现：[姓名 1]，[姓名 2]，[姓名 3]，[姓名 4]，[姓名 5]，[姓名 6]，[姓名 7]，[姓名 8]，[姓名 9]，[姓名 10]
	- 排行榜实现：[姓名 11]，[姓名 12]
	- 模型评估：[姓名 13]，[姓名 14]，[姓名 15]，[姓名 16]，[姓名 17]
	- 沟通：[姓名 18]，[姓名 19]
	- 组织与协作领导：[姓名 20]，[姓名 21]，[姓名 22]，[姓名 23]，[姓名 24]，[姓名 25]，[姓名 26]，[姓名 27]，[姓名 28]，[姓名 29]，[姓名 30]

	有关数据集作者的信息，请查看相应的数据集卡片（可以在 "任务" 标签页中找到）以及论文（在 "引用" 部分提供）。我们特别感谢那些为排行榜专门创建或开源其数据集的团队（按时间顺序）：
	- [数据集1 占位符] 和 [数据集2 占位符]： [团队成员占位符]
	- [数据集3 占位符]，[数据集4 占位符] 和 [数据集5 占位符]： [团队成员占位符]
	- [数据集6 占位符]： [团队成员占位符]

	我们还感谢 [机构1 占位符]，[机构2 占位符]，[组织占位符]，[人员1 占位符]，[人员2 占位符] 和 [机构3 占位符] 提供推理 GPU 支持。

	## 🚀 合作！

	我们希望创建一个尽可能多样化的排行榜，欢迎联系我们如果你希望我们将你的评估数据集包含在内！

	评论和建议非常欢迎！请访问 [👏 社区](<Community Page Placeholder>) 页面，告诉我们你对 MMMLU 排行榜的看法以及我们如何改进，或者直接打开一个 PR！

	非常感谢！ 💛
	"""

	EVALUATION_QUEUE_TEXT = """
	## Some good practices before submitting a model

	### 1) Make sure you can load your model and tokenizer using AutoClasses:
	```python
	from transformers import AutoConfig, AutoModel, AutoTokenizer
	config = AutoConfig.from_pretrained("your model name", revision=revision)
	model = AutoModel.from_pretrained("your model name", revision=revision)
	tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
	```
	If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

	Note: make sure your model is public!
	Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

	### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
	It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

	### 3) Make sure your model has an open license!
	This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗

	### 4) Fill up your model card
	When we add extra information about models to the leaderboard, it will be automatically taken from the model card

	## In case of model failure
	If your model is displayed in the `FAILED` category, its execution stopped.
	Make sure you have followed the above steps first.
	If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""@misc{Multilingual MMLU Benchmark Leaderboard2024,
	author = {Yi Zhou and Yusuke Sakai and Yongxin Zhou and Haonan Li and Jiahui Geng and Qing Li and Wenxi Li and Yuanyu Lin and Andy Way and Zhuang Li and Zhongwei Wan and Di Wu and Wen Lai and Bo Zeng},
	title = {Multilingual MMLU Benchmark Leaderboard},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = "\url{https://huggingface.co/spaces/StarscreamDeceptions/Multilingual-MMLU-Benchmark-Leaderboard}"

	@article{hendrycks2020measuring,
	title={Measuring massive multitask language understanding},
	author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
	journal={arXiv preprint arXiv:2009.03300},
	year={2020}
	}
	"""
	EVALUATION_QUEUE_TEXT_ZH = """
	## 提交模型前的一些良好实践

	### 1) 确保你可以使用 AutoClasses 加载你的模型和分词器：
	```python
	from transformers import AutoConfig, AutoModel, AutoTokenizer
	config = AutoConfig.from_pretrained("your model name", revision=revision)
	model = AutoModel.from_pretrained("your model name", revision=revision)
	tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
	```
	如果此步骤失败，请按照错误信息进行调试，可能是你的模型上传不正确。

	注意：确保你的模型是公开的！注意：如果你的模型需要 use_remote_code=True，目前我们不支持该选项，但我们正在努力添加此功能，请保持关注！

	2) 将你的模型权重转换为 safetensors
	这是一个新的权重存储格式，加载和使用时更安全、更快速。它还将允许我们将模型的参数数量添加到 Extended Viewer 中！

	3) 确保你的模型具有开放许可！
	这是一个针对开放 LLM 的排行榜，我们希望尽可能多的人知道他们可以使用你的模型 🤗

	4) 填写你的模型卡
	当我们将额外的信息添加到排行榜时，它将自动从模型卡中获取。

	模型失败时的处理
	如果你的模型出现在 FAILED 分类中，表示其执行停止。首先确保你已经遵循了上述步骤。如果一切都完成，检查你是否可以使用上面的命令在本地启动 EleutherAIHarness 来测试你的模型（你可以添加 --limit 来限制每个任务的示例数）。 """

	# CITATION_BUTTON_LABEL = "复制以下代码引用这些结果"
	# CITATION_BUTTON_TEXT = r"""
	# """
	LOGOS = [
	# "logo/amsterdam-logo.png",
	# "logo/cardiff-logo.png",
	# "logo/coimbra-logo.png",
	# "logo/dcu-logo.png",
	# "logo/MBZU-logo.png",
	# "logo/NAIST-logo.png",
	# "logo/OSU-logo.png",
	# "logo/rmit.png",
	# "logo/sjtu-logo.png",
	# "logo/tsinghua-logo.png",
	# "logo/UGA-logo.png",
	# "logo/um-logo.png"
	"logo/all.png"
	]