Spaces:
Sleeping
Sleeping
title: LeaderboardFinder | |
emoji: 🐢 | |
colorFrom: pink | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 4.22.0 | |
app_file: app.py | |
pinned: false | |
If you want your leaderboard to appear, feel free to add relevant information in its metadata, and it will be displayed here. | |
# Categories | |
## Submission type | |
Arenas are not concerned by this category. | |
- `submission:automatic`: users can submit their models as such to the leaderboard, and evaluation is run automatically without human intervention | |
- `submission:semiautomatic`: the leaderboard requires the model owner to run evaluations on his side and submit the results | |
- `submission:manual`: the leaderboard requires the leaderboard owner to run evaluations for new submissions | |
- `submission:closed`: the leaderboard does not accept submissions at the moment | |
## Test set status | |
Arenas are not concerned by this category. | |
- `test:public`: all the test sets used are public, the evaluations are completely reproducible | |
- `test:mix`: some test sets are public and some private | |
- `test:private`: all the test sets used are private, the evaluations are hard to game | |
- `test:rolling`: the test sets used change regularly through time and evaluation scores are refreshed | |
## Judges | |
- `judge:auto`: evaluations are run automatically, using an evaluation suite such as `lm_eval` or `lighteval` | |
- `judge:model`: evaluations are run using a model as a judge approach to rate answer | |
- `judge:humans`: evaluations are done by humans to rate answer - this is an arena | |
- `judge:vibe_check`: evaluations are done manually by one human | |
## Modalities | |
Can be any (or several) of the following list: | |
- `modality:text` | |
- `modality:image` | |
- `modality:video` | |
- `modality:audio` | |
A bit outside of usual modalities | |
- `modality:tools`: requires added tool usage - mostly for assistant models | |
- `modality:artefacts`: the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings. | |
## Evaluation categories | |
Can be any (or several) of the following list: | |
- `eval:generation`: the evaluation looks at generation capabilities specifically (can be image generation, text generation, ...) | |
- `eval:math` | |
- `eval:code` | |
- `eval:performance`: model performance (speed, energy consumption, ...) | |
- `eval:safety`: safety, toxicity, bias evaluations | |
## Language | |
You can indicate the languages covered by your benchmark like so: `language:mylanguage`. | |
At the moment, we do not support language codes, please use the language name in English. |