Clémentine
need to add the selectors now
4b2522c
|
raw
history blame
2.52 kB
---
title: LeaderboardFinder
emoji: 🐢
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 4.22.0
app_file: app.py
pinned: false
---
If you want your leaderboard to appear, feel free to add relevant information in its metadata, and it will be displayed here.
# Categories
## Submission type
Arenas are not concerned by this category.
- `submission:automatic`: users can submit their models as such to the leaderboard, and evaluation is run automatically without human intervention
- `submission:semiautomatic`: the leaderboard requires the model owner to run evaluations on his side and submit the results
- `submission:manual`: the leaderboard requires the leaderboard owner to run evaluations for new submissions
- `submission:closed`: the leaderboard does not accept submissions at the moment
## Test set status
Arenas are not concerned by this category.
- `test:public`: all the test sets used are public, the evaluations are completely reproducible
- `test:mix`: some test sets are public and some private
- `test:private`: all the test sets used are private, the evaluations are hard to game
- `test:rolling`: the test sets used change regularly through time and evaluation scores are refreshed
## Judges
- `judge:auto`: evaluations are run automatically, using an evaluation suite such as `lm_eval` or `lighteval`
- `judge:model`: evaluations are run using a model as a judge approach to rate answer
- `judge:humans`: evaluations are done by humans to rate answer - this is an arena
- `judge:vibe_check`: evaluations are done manually by one human
## Modalities
Can be any (or several) of the following list:
- `modality:text`
- `modality:image`
- `modality:video`
- `modality:audio`
A bit outside of usual modalities
- `modality:tools`: requires added tool usage - mostly for assistant models
- `modality:artefacts`: the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings.
## Evaluation categories
Can be any (or several) of the following list:
- `eval:generation`: the evaluation looks at generation capabilities specifically (can be image generation, text generation, ...)
- `eval:math`
- `eval:code`
- `eval:performance`: model performance (speed, energy consumption, ...)
- `eval:safety`: safety, toxicity, bias evaluations
## Language
You can indicate the languages covered by your benchmark like so: `language:mylanguage`.
At the moment, we do not support language codes, please use the language name in English.