Spaces:

silma-ai
/

Arabic-LLM-Broad-Leaderboard

Running

App Files Files Community

Arabic-LLM-Broad-Leaderboard / src /about.py

karimouda

Update src/about.py

ac88f13 verified about 2 months ago

raw

history blame contribute delete

7.79 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class EvalDimension:
	metric: str
	col_name: str


	# Select your tasks here
	# ---------------------------------------------------
	class EvalDimensions(Enum):
	d0 = EvalDimension("speed", "Speed (words/sec)")
	d1 = EvalDimension("contamination_score", "Contamination Score")
	d2 = EvalDimension("paraphrasing", "Paraphrasing")
	d3 = EvalDimension("sentiment analysis", "Sentiment Analysis")
	d4 = EvalDimension("coding", "Coding")
	d5 = EvalDimension("function calling", "Function Calling")
	d6 = EvalDimension("rag qa", "RAG QA")
	d7 = EvalDimension("reading comprehension", "Reading Comprehension")
	d8 = EvalDimension("entity extraction", "Entity Extraction")
	d9 = EvalDimension("summarization", "Summarization")
	d10 = EvalDimension("long context", "Long Context")
	d11 = EvalDimension("mmlu", "MMLU")
	d12 = EvalDimension("arabic language & grammar", "Arabic Language & Grammar")
	d13 = EvalDimension("general knowledge", "General Knowledge")
	d14 = EvalDimension("translation (incl dialects)", "Translation (incl Dialects)")
	d15 = EvalDimension("trust & safety","Trust & Safety")
	d16 = EvalDimension("writing (incl dialects)", "Writing (incl Dialects)")
	d17 = EvalDimension("dialect detection", "Dialect Detection")
	d18 = EvalDimension("reasoning & math", "Reasoning & Math")
	d19 = EvalDimension("diacritization", "Diacritization")
	d20 = EvalDimension("instruction following", "Instruction Following")
	d21 = EvalDimension("transliteration", "Transliteration")
	d22 = EvalDimension("structuring", "Structuring")
	d23 = EvalDimension("hallucination", "Hallucination")




	# Your leaderboard name
	TITLE = """<div ><img class='abl_header_image' src='https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard/resolve/main/src/images/abl_logo.png' ></div>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	<h1 style='width: 100%;text-align: center;' id="space-title">Arabic Broad Leaderboard (ABL) - The first comprehensive Leaderboard for Arabic LLMs</h1>
	ABL, the official leaderboard of the <a href='https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark' target='_blank'>Arabic Broad Benchmark (ABB)</a>
	is a next-generation leaderboard offering innovative visualizations, analytical capabilities, model skill breakdowns, speed comparisons, and contamination detection mechanisms. ABL provides the community with an unprecedented ability to study the capabilities of Arabic models and choose the right model for the right task. Find more details in the FAQ section.
	"""

	# Which evaluations are you running? how can people reproduce what you have?
	LLM_BENCHMARKS_TEXT = f"""
	# FAQ

	---

	## What is the Benchmark Score?

	* The benchmark score is calculated by taking the average of all individual question scores.
	* Each question is scored from 0 to 10 using a mix of LLM-as-judge and manual rules, depending on the question type.
	* Please refer to the ABB page below for more information about the scoring rules and the dataset:

	https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark#scoring-rules

	---

	## What is the Contamination Score?

	* The contamination score is a score measuring the probability that a model has been trained using the ABB benchmarking data to boost its scores on ABL.
	* After testing each model on ABL, we run our private algorithm to detect contamination and arrive at a score.
	* Contaminated models will show a red sign and a number above zero in the Contamination Score column.
	* Any model showing signs of contamination will be deleted instantly from the leaderboard.

	---

	## What is the Speed?

	* Speed shows how fast the model was during testing, using the "words per second" metric.
	* The score is calculated by dividing the number of words generated by the model during the entire test by the time taken (in seconds) for the model to complete testing.
	* Please note that we use the same GPU (A100) and a batch size of 1 for all Hugging Face models to ensure a fair comparison. Models above 15B are split across multiple GPUs.
	* Each model should only be compared to other models in its size category.
	* API or closed models can't be compared to open models, only to other API models, since they are not hosted on our infrastructure.
	* API providers may impose rate limits on new models, which can impact their overall speed on the leaderboard. It is important to consider this posibility when comparing API speeds.

	---

	## What does Size mean?

	* Models below 3.5B parameters are considered Nano.
	* Models between 3.5B and 10B parameters are considered Small.
	* Models between 10B and 35B parameters are considered Medium.
	* Models above 35B parameters are considered Large.

	---

	## What does Source mean?

	* API: Closed models tested via an API.
	* Hugging Face: Open models downloaded and tested from Hugging Face via the `transformers` library.

	---

	## How can I reproduce the results?

	You can easily reproduce the results of any model by following the steps on the ABB page below:

	https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark#how-to-use-abb-to-benchmark-a-model

	---

	## I tested a model and got a slightly different score. Why is that?

	* ABB is partially dependent on an external LLM-as-judge (GPT-4.1).
	* LLMs are random in nature and will not always produce the same scores on every run.
	* That said, according to our testing, such variations are always within a +/-1% range.

	---

	## I have seen an answer which seems correct to me but is getting a zero score. Why is that?

	* First, LLM scoring is not always consistent, and sometimes it gives a wrong score to an answer, but based on our testing, this is very rare.
	* Second, we also have fixed rules in place to penalize models; for example, when a model answers in another language or answers in two languages, we give a score of zero.
	* In general, both fixed rules and LLM inconsistencies are applied to all models in the same way, which we consider fair.

	---

	## Why am I not allowed to submit models with more than 15B parameters?

	* Models above 15B parameters don't fit into a single GPU and require provisioning of multiple GPUs, which we can't always guarantee to provision in an automated manner.
	* We also know that most community models are below 15B parameters.
	* As an exception, we can accept requests from organizations on a case-by-case basis.
	* Finally, we will always make sure to include larger models when they have high adoption from the community.

	---

	## How can I learn more about ABL and ABB?

	Feel free to read through the following resources:

	* ABB Page: https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark
	* ABL blog post: https://huggingface.co/blog/silma-ai/arabic-llm-leaderboard

	---

	## How can I contact the benchmark maintainers?

	You can contact us via benchmark@silma.ai

	"""

	EVALUATION_QUEUE_TEXT = """

	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite the Leaderboard"
	CITATION_BUTTON_TEXT = r"""
	@misc{ABL,
	author = {SILMA.AI Team},
	title = {Arabic Broad Leaderboard},
	year = {2025},
	publisher = {SILMA.AI},
	howpublished = "{\url{https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard}}"
	}

	"""

	FOOTER_TEXT = """<div style='display:flex;justify-content:center;align-items:center;'>
	<span style='font-size:36px;font-weight:bold;margin-right:20px;'>Sponsored By</span>
	<a href='https://silma.ai/?ref=abl' target='_blank'>
	<img style='height:60px' src='https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard/resolve/main/src/images/silma-logo-wide.png' >
	</a>
	</div>"""