documentation / contents.py
Clémentine
added blog list
b38f281
import urllib.request
import yaml
TITLE = """<h1 style="text-align:center;float:center; id="space-title">Leaderboards on the hub - Documentation</h1>"""
IMAGE = "![Leaderboards on the Hub](https://raw.githubusercontent.com/huggingface/blog/main/assets/leaderboards-on-the-hub/thumbnail.png)"
INDEX_PAGE = """
# Leaderboards on the Hub - Documentation
As the number of open and closed source machine learning models explodes, we wanted to make it evaluation simpler.
This space contains documentation to
- easily explore interesting leaderboards to find the best model for your use case
- build your own to test specific capabilities which interest you and the community.
Have fun evaluating!
"""
INTRO_PAGE = """
# Introduction
## 🏅 What are leaderboards?
`Leaderboards` are rankings of machine learning artefacts (most frequently generative models, but also embeddings, classifiers, ...) depending on their performance on given tasks across relevant modalities.
They are commonly used to find the best model for a specific use case.
For example, for Large Language Models, the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) allows you to find the best base pre-trained models in English, using a range of academic evaluations looking at language understanding, general knowledge, and math, and the [Chatbot Arena Leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) provides a ranking of the best chat models in English, thanks to user votes on chat capabilities.
So far on the Hub, we have leaderboards for text, image, video and audio generations, including specialised leaderboard for at least 10 natural (human) languages, and a number of capabilities such as math or code. We also have leaderboards evaluating more general aspects like energy performance or model safety.
Some specific leaderboards reflect human performance obtained through a human-based voting system, where people compare models and vote for the better one on a given task. These spaces are called `arenas`.
## ⚖️ How to use leaderboards properly
There are certain things to keep in mind when using a leaderboard.
### 1) Comparing apples to apples
Much like in sports, where we have weight categories to keep rankings fair, when evaluating model artefacts, you want to compare similar items.
For example, when comparing models, you want them to be
- in the same weight class (number of parameters): bigger models often have better performance than smaller ones, but they usually cost more to run and train (in money, time, and energy)
- at the same mathematical precision: the lower the precision of your model, the smaller and faster, but this can affect performance
- in the same category: pre-trained models are good generalist bases, where fine-tuned models are more specialised and better performing on specific tasks, and merged models tend to have scores higher than their actual performance.
### 2) Comparing across a spectrum of tasks
Though good generalist machine learning models are becoming increasingly common, it's not because an LLM is good at chess that it will output good poetry. If you want to select the correct model for your use case, you need to look at its scores and performance across a range of leaderboards and tasks, before testing it yourself to make sure it fits your needs.
### 3) Being careful about evaluation limitations, especially for models
A number of evaluations are very easy to cheat, accidentally or not: if a model has already seen the data used for testing, its performance will be high "artificially", and reflect memorisation rather than any actual capability on the task. This mechanism is called `contamination`.
Evaluations of closed source models are not always still accurate some time later: as closed source models are behind APIs, it is not possible to know how the model changes and what is added or removed through time (contrary to open source models, where relevant information is available). As such, you should not assume that a static evaluation of a closed source model at time t will still be valid some time later.
"""
# We extract the most recent blogs to display and embed them
blog_info = "https://raw.githubusercontent.com/huggingface/blog/main/_blog.yml"
with urllib.request.urlopen(blog_info) as f:
file = yaml.safe_load(f.read())
recent_blogs = blogs = [f for f in file[::-1] if "leaderboard" in f["tags"]][:5]
def return_blog_code(blogs_yaml):
# Not used at the moment, but could be improved if we wanted the images too
first_row = "|".join([f"[{blog['title']}](https://huggingface.co/blog/{blog['local']})" for blog in blogs_yaml])
second_row = "|".join([":---:" for _ in blogs_yaml])
third_row = "|".join([f" ![](https://huggingface.co{blog['thumbnail']}) " for blog in blogs_yaml])
return "\n\n|" + first_row + "|\n|" + second_row + "|\n|" + third_row + "|\n\n"
def return_blog_list(blogs_yaml):
return "\n- ".join([" "] + [f"[{blog['title']}](https://huggingface.co/blog/{blog['local']})" for blog in blogs_yaml])
FINDING_PAGE = """
# Finding the best leaderboard for your use case
## ✨ Featured leaderboards
Since the end of 2023, we have worked with partners with strong evaluation knowledge, to highlight their work as a blog series, called [`Leaderboards on the Hub`](https://huggingface.co/blog?tag=leaderboard).
Here are the most recent blogs we wrote together:
""" + return_blog_list(recent_blogs) + """
This series is particularly interesting to understand the subtelties of evaluation across different modalities and topics, and we hope it will act as a knowledge base in the future.
## 🔍 Explore Spaces by yourself
On the Hub, `leaderboards` and `arenas` are hosted as Spaces, like machine learning demos.
You can either look for the keywords `leaderboard` or `arena` in the space title using the search bar [here](https://huggingface.co/spaces) (or [this link](https://huggingface.co/spaces?sort=trending&search=leaderboard)), in the full space using the "Full-text search", or look for spaces with correct metadata by looking for the `leaderboard` tags [here](https://huggingface.co/spaces?filter=leaderboard).
We also try to maintain an [up to date collection](https://huggingface.co/collections/clefourrier/leaderboards-and-benchmarks-64f99d2e11e92ca5568a7cce) of leaderboards. If we missed your space, tag one of the members of the evaluation team in the space discussion!
"""
BUILDING_PAGE = """
# Building a leaderboard using a template
To build a leaderboard, the easiest is to look at our demo templates [here](https://huggingface.co/demo-leaderboard-backend)
## 📏 Contents
Our demo leaderboard template contains 4 sections: two spaces and two datasets.
- The `frontend space` displays the results to users, contains explanations about evaluations, and optionally can accept model submissions.
- The `requests dataset` stores the submissions of users, and the status of model evaluations. It is updated by the frontend (at submission time) and the backend (at running time).
- The `results dataset` stores the results of the evaluations. It is updated by the backend when evaluations are finished, and pulled by the frontend for display.
- The `backend space` is optional, if you run evaluations manually or on your own cluster. It looks at currently pending submissions, and launches their evaluation using either the Eleuther AI Harness (`lm_eval`) or HuggingFace's `lighteval`, then updates the evaluation status and stores the results. It needs to be edited with your own evaluation suite to fit your own use cases if you use something more specific.
## 🪛 Getting started
You should copy the two spaces and the two datasets to your org to get started with your own leaderboard!
### Setting up the front end
To get started on your own front end leaderboard, you will need to edit 2 files:
- src/envs.py to define your own environment variable (like the org name in which this has been copied)
- src/about.py with the tasks and number of few_shots you want for your tasks
### Setting up fake results to initialize the leaderboard
Once this is done, you need to edit the "fake results" file to fit the format of your tasks: in the sub dictionary `results`, replace task_name1 and metric_name by the correct values you defined in Tasks above.
```
"results": {
"task_name1": {
"metric_name": 0
}
}
```
At this step, you should alread have some results displayed in the front end!
Any more model you want to add will need to have a file in request and one in result, following the same template as already present files.
### Optional: Setting up the backend
If you plan on running your evaluations on spaces, you then need to edit the backend to run the evaluations most relevant for you in the way you want.
Depending on the suite you want to learn, this is the part which is likely to take the most time.
However, this is optional if you only want to use the leaderboard to display results, or plan on running evaluations manually/on your own compute source.
## 🔧 Tips and tricks
Leaderboards setup in the above fashion are adjustable, from providing fully automated evaluations (a user submits a model, it is evaluated, etc) to fully manual (every new evaluation is ran with human control) to semi-automatic.
When running the backend in Spaces, you can either :
- upgrade your backend space to the compute power level you require, and run your evaluations locally (using `lm_eval`, `lighteval`, or your own evaluation suite); this is the most general solution across evaluation types, but it will limit you in terms of model size possible, as you might not be able to fit the biggest models in the backend
- use a suite which does model inference using API calls, such as `lighteval` which uses `inference-endpoints` to automatically spin up models from the hub for evaluation, allowing you to scale the size of your compute to the current model.
If you run evaluations on your own compute source, you can still grab some of the files from the backend to pull and push the `results` and `request` datasets.
Once your leaderboard is setup, don't forget to set its metadata so it gets indexed by our Leaderboard Finder. See "What do the tags mean?" in the [LeaderboardFinder](https://huggingface.co/spaces/leaderboards/LeaderboardFinder) space.
"""
EXTRAS_PAGE = """
# Building features around your leaderboard
Several cool tools can be duplicated/extended for your leaderboard:
- If you want your leaderboard to push model results to model cards, you can duplicate this [great space](https://huggingface.co/spaces/Weyaxi/leaderboard-results-to-modelcard) and update it for your own leaderboard.
"""