Clémentine
need to add the selectors now
4b2522c
|
raw
history blame
2.52 kB
metadata
title: LeaderboardFinder
emoji: 🐢
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 4.22.0
app_file: app.py
pinned: false

If you want your leaderboard to appear, feel free to add relevant information in its metadata, and it will be displayed here.

Categories

Submission type

Arenas are not concerned by this category.

  • submission:automatic: users can submit their models as such to the leaderboard, and evaluation is run automatically without human intervention
  • submission:semiautomatic: the leaderboard requires the model owner to run evaluations on his side and submit the results
  • submission:manual: the leaderboard requires the leaderboard owner to run evaluations for new submissions
  • submission:closed: the leaderboard does not accept submissions at the moment

Test set status

Arenas are not concerned by this category.

  • test:public: all the test sets used are public, the evaluations are completely reproducible
  • test:mix: some test sets are public and some private
  • test:private: all the test sets used are private, the evaluations are hard to game
  • test:rolling: the test sets used change regularly through time and evaluation scores are refreshed

Judges

  • judge:auto: evaluations are run automatically, using an evaluation suite such as lm_eval or lighteval
  • judge:model: evaluations are run using a model as a judge approach to rate answer
  • judge:humans: evaluations are done by humans to rate answer - this is an arena
  • judge:vibe_check: evaluations are done manually by one human

Modalities

Can be any (or several) of the following list:

  • modality:text
  • modality:image
  • modality:video
  • modality:audio A bit outside of usual modalities
  • modality:tools: requires added tool usage - mostly for assistant models
  • modality:artefacts: the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings.

Evaluation categories

Can be any (or several) of the following list:

  • eval:generation: the evaluation looks at generation capabilities specifically (can be image generation, text generation, ...)
  • eval:math
  • eval:code
  • eval:performance: model performance (speed, energy consumption, ...)
  • eval:safety: safety, toxicity, bias evaluations

Language

You can indicate the languages covered by your benchmark like so: language:mylanguage. At the moment, we do not support language codes, please use the language name in English.