Metric List

Automatic metrics for multiple-choice tasks

These metrics use log-likelihood of the different possible targets.

loglikelihood_acc: Fraction of instances where the choice with the best logprob was correct - we recommend using a normalization by length
loglikelihood_f1: Corpus level F1 score of the multichoice selection
mcc: Matthew’s correlation coefficient (a measure of agreement between statistical distributions).
recall_at_k: Fraction of instances where the choice with the k-st best logprob or better was correct
mrr: Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance
target_perplexity: Perplexity of the different choices available.
acc_golds_likelihood: A bit different, it actually checks if the average logprob of a single target is above or below 0.5.
multi_f1_numeric: Loglikelihood F1 score for multiple gold targets.

Automatic metrics for perplexity and language modeling

These metrics use log-likelihood of prompt.

word_perplexity: Perplexity (log probability of the input) weighted by the number of words of the sequence.
byte_perplexity: Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
bits_per_byte: Average number of bits per byte according to model probabilities.
log_prob: Predicted output’s average log probability (input’s log prob for language modeling).

Automatic metrics for generative tasks

These metrics need the model to generate an output. They are therefore slower.

Base:
- exact_match: Fraction of instances where the prediction matches the gold. Several variations can be made through parametrization:
  - normalization on string pre-comparision on whitespace, articles, capitalization, …
  - comparing the full string, or only subsets (prefix, suffix, …)
- maj_at_k: Model majority vote. Samples k generations from the model and assumes the most frequent is the actual prediction.
- f1_score: Average F1 score in terms of word overlap between the model output and gold (normalisation optional).
- f1_score_macro: Corpus level macro F1 score.
- f1_score_macro: Corpus level micro F1 score.
Summarization:
- rouge: Average ROUGE score (Lin, 2004).
- rouge1: Average ROUGE score (Lin, 2004) based on 1-gram overlap.
- rouge2: Average ROUGE score (Lin, 2004) based on 2-gram overlap.
- rougeL: Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
- rougeLsum: Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
- rouge_t5 (BigBench): Corpus level ROUGE score for all available ROUGE metrics.
- faithfulness: Faithfulness scores based on the SummaC method of Laban et al. (2022).
- extractiveness: Reports, based on (Grusky et al., 2018):
  - summarization_coverage: Extent to which the model-generated summaries are extractive fragments from the source document,
  - summarization_density: Extent to which the model-generated summaries are extractive summaries based on the source document,
  - summarization_compression: Extent to which the model-generated summaries are compressed relative to the source document.
- bert_score: Reports the average BERTScore precision, recall, and f1 score (Zhang et al., 2020) between model generation and gold summary.
Translation:
- bleu: Corpus level BLEU score (Papineni et al., 2002) - uses the sacrebleu implementation.
- bleu_1: Average sample BLEU score (Papineni et al., 2002) based on 1-gram overlap - uses the nltk implementation.
- bleu_4: Average sample BLEU score (Papineni et al., 2002) based on 4-gram overlap - uses the nltk implementation.
- chrf: Character n-gram matches f-score.
- ter: Translation edit/error rate.
Copyright:
- copyright: Reports:
  - longest_common_prefix_length: Average length of longest common prefix between model generation and reference,
  - edit_distance: Average Levenshtein edit distance between model generation and reference,
  - edit_similarity: Average Levenshtein edit similarity (normalized by the length of longer sequence) between model generation and reference.
Math:
- Both exact_match and maj_at_k can be used to evaluate mathematics tasks with math specific normalization to remove and filter latex.

LLM-as-Judge

llm_judge_gpt3p5: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API.
llm_judge_llama_3_405b: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API.
llm_judge_multi_turn_gpt3p5: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API. It is used for multiturn tasks like mt-bench.
llm_judge_multi_turn_llama_3_405b: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API. It is used for multiturn tasks like mt-bench.

Update on GitHub

Lighteval

Metric List

Automatic metrics for multiple-choice tasks

Automatic metrics for perplexity and language modeling

Automatic metrics for generative tasks

LLM-as-Judge