Spaces:
Duplicated from leaderboards/LeaderboardsExplorer

pngwn
/

LeaderboardsExplorer

Sleeping

App Files Files Community

LeaderboardsExplorer / README.md

Clémentine

need to add the selectors now

4b2522c 9 months ago

|

2.52 kB

metadata

title: LeaderboardFinder
emoji: 🐢
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 4.22.0
app_file: app.py
pinned: false

If you want your leaderboard to appear, feel free to add relevant information in its metadata, and it will be displayed here.

Categories

Submission type

Arenas are not concerned by this category.

submission:automatic: users can submit their models as such to the leaderboard, and evaluation is run automatically without human intervention
submission:semiautomatic: the leaderboard requires the model owner to run evaluations on his side and submit the results
submission:manual: the leaderboard requires the leaderboard owner to run evaluations for new submissions
submission:closed: the leaderboard does not accept submissions at the moment

Test set status

Arenas are not concerned by this category.

test:public: all the test sets used are public, the evaluations are completely reproducible
test:mix: some test sets are public and some private
test:private: all the test sets used are private, the evaluations are hard to game
test:rolling: the test sets used change regularly through time and evaluation scores are refreshed

Judges

judge:auto: evaluations are run automatically, using an evaluation suite such as lm_eval or lighteval
judge:model: evaluations are run using a model as a judge approach to rate answer
judge:humans: evaluations are done by humans to rate answer - this is an arena
judge:vibe_check: evaluations are done manually by one human

Modalities

Can be any (or several) of the following list:

modality:text
modality:image
modality:video
modality:audio A bit outside of usual modalities
modality:tools: requires added tool usage - mostly for assistant models
modality:artefacts: the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings.

Evaluation categories

Can be any (or several) of the following list:

eval:generation: the evaluation looks at generation capabilities specifically (can be image generation, text generation, ...)
eval:math
eval:code
eval:performance: model performance (speed, energy consumption, ...)
eval:safety: safety, toxicity, bias evaluations

Language

You can indicate the languages covered by your benchmark like so: language:mylanguage. At the moment, we do not support language codes, please use the language name in English.