Lighteval documentation

Metrics

Lighteval

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Metrics

Metrics

Metric

class lighteval.metrics.Metric

( metric_name: str higher_is_better: bool category: SamplingMethod sample_level_fn: lighteval.metrics.metrics_sample.SampleLevelComputation | lighteval.metrics.sample_preparator.Preparator corpus_level_fn: typing.Union[lighteval.metrics.metrics_corpus.CorpusLevelComputation, typing.Callable] batched_compute: bool = False )

CorpusLevelMetric

class lighteval.metrics.utils.metric_utils.CorpusLevelMetric

( metric_name: str higher_is_better: bool category: SamplingMethod sample_level_fn: lighteval.metrics.metrics_sample.SampleLevelComputation | lighteval.metrics.sample_preparator.Preparator corpus_level_fn: typing.Union[lighteval.metrics.metrics_corpus.CorpusLevelComputation, typing.Callable] batched_compute: bool = False )

Metric computed over the whole corpora, with computations happening at the aggregation phase

SampleLevelMetric

class lighteval.metrics.utils.metric_utils.SampleLevelMetric

( metric_name: str higher_is_better: bool category: SamplingMethod sample_level_fn: lighteval.metrics.metrics_sample.SampleLevelComputation | lighteval.metrics.sample_preparator.Preparator corpus_level_fn: typing.Union[lighteval.metrics.metrics_corpus.CorpusLevelComputation, typing.Callable] batched_compute: bool = False )

Metric computed per sample, then aggregated over the corpus

MetricGrouping

class lighteval.metrics.utils.metric_utils.MetricGrouping

( metric_name: list higher_is_better: dict category: SamplingMethod sample_level_fn: lighteval.metrics.metrics_sample.SampleLevelComputation | lighteval.metrics.sample_preparator.Preparator corpus_level_fn: dict batched_compute: bool = False )

Some metrics are more advantageous to compute together at once. For example, if a costly preprocessing is the same for all metrics, it makes more sense to compute it once.

CorpusLevelMetricGrouping

class lighteval.metrics.utils.metric_utils.CorpusLevelMetricGrouping

( metric_name: list higher_is_better: dict category: SamplingMethod sample_level_fn: lighteval.metrics.metrics_sample.SampleLevelComputation | lighteval.metrics.sample_preparator.Preparator corpus_level_fn: dict batched_compute: bool = False )

MetricGrouping computed over the whole corpora, with computations happening at the aggregation phase

SampleLevelMetricGrouping

class lighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping

( metric_name: list higher_is_better: dict category: SamplingMethod sample_level_fn: lighteval.metrics.metrics_sample.SampleLevelComputation | lighteval.metrics.sample_preparator.Preparator corpus_level_fn: dict batched_compute: bool = False )

MetricGrouping are computed per sample, then aggregated over the corpus

Corpus Metrics

CorpusLevelF1Score

class lighteval.metrics.metrics_corpus.CorpusLevelF1Score

( average: str num_classes: int = 2 )

compute_corpus

( items: list )

Computes the metric score over all the corpus generated items, by using the scikit learn implementation.

CorpusLevelPerplexityMetric

class lighteval.metrics.metrics_corpus.CorpusLevelPerplexityMetric

( metric_type: str )

compute_corpus

( items: list )

Computes the metric score over all the corpus generated items.

CorpusLevelTranslationMetric

class lighteval.metrics.metrics_corpus.CorpusLevelTranslationMetric

( metric_type: str lang: typing.Literal['zh', 'ja', 'ko', ''] = '' )

compute_corpus

( items: list )

Computes the metric score over all the corpus generated items, by using the sacrebleu implementation.

MatthewsCorrCoef

class lighteval.metrics.metrics_corpus.MatthewsCorrCoef

( )

compute_corpus

( items: list ) → float

Parameters

items (list[dict]) — List of GenerativeCorpusMetricInput

Returns

float

Score

Computes the Matthews Correlation Coefficient, using scikit learn (doc).

Sample Metrics

ExactMatches

class lighteval.metrics.metrics_sample.ExactMatches

( aggregation_function: typing.Callable[[list[float]], float] = <built-in function max> normalize_gold: typing.Optional[typing.Callable[[str], str]] = None normalize_pred: typing.Optional[typing.Callable[[str], str]] = None strip_strings: bool = False type_exact_match: str = 'full' )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

float

Aggregated score over the current sample’s items.

Computes the metric over a list of golds and predictions for one single sample.

compute_one_item

( gold: str pred: str ) → float

Parameters

gold (str) — One of the possible references
pred (str) — One of the possible predictions

Returns

float

The exact match score. Will be 1 for a match, 0 otherwise.

Compares two strings only.

F1_score

class lighteval.metrics.metrics_sample.F1_score

( aggregation_function: typing.Callable[[list[float]], float] = <built-in function max> normalize_gold: typing.Optional[typing.Callable[[str], str]] = None normalize_pred: typing.Optional[typing.Callable[[str], str]] = None strip_strings: bool = False )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

float

Aggregated score over the current sample’s items.

Computes the metric over a list of golds and predictions for one single sample.

compute_one_item

( gold: str pred: str ) → float

Parameters

gold (str) — One of the possible references
pred (str) — One of the possible predictions

Returns

float

The f1 score over the bag of words, computed using nltk.

Compares two strings only.

LoglikelihoodAcc

class lighteval.metrics.metrics_sample.LoglikelihoodAcc

( logprob_normalization: lighteval.metrics.normalizations.LogProbCharNorm | lighteval.metrics.normalizations.LogProbTokenNorm | lighteval.metrics.normalizations.LogProbPMINorm | None = None )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → int

Parameters

doc (Doc) — The document containing choices and gold indices.
model_response (ModelResponse) — The model’s response containing logprobs.
**kwargs — Additional keyword arguments.

Returns

int

The eval score: 1 if the best log-prob choice is in gold, 0 otherwise.

Computes the log likelihood accuracy: is the choice with the highest logprob in choices_logprob present in the gold_ixs?

NormalizedMultiChoiceProbability

class lighteval.metrics.metrics_sample.NormalizedMultiChoiceProbability

( log_prob_normalization: lighteval.metrics.normalizations.LogProbCharNorm | lighteval.metrics.normalizations.LogProbTokenNorm | lighteval.metrics.normalizations.LogProbPMINorm | None = None aggregation_function: typing.Callable[[numpy.ndarray], float] = <function max at 0x7f72ac3027f0> )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing choices and gold indices.
model_response (ModelResponse) — The model’s response containing logprobs.
**kwargs — Additional keyword arguments.

Returns

float

The probability of the best log-prob choice being a gold choice.

Computes the log likelihood probability: chance of choosing the best choice.

Probability

class lighteval.metrics.metrics_sample.Probability

( normalization: lighteval.metrics.normalizations.LogProbTokenNorm | None = None aggregation_function: typing.Callable[[numpy.ndarray], float] = <function max at 0x7f72ac3027f0> )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing choices and gold indices.
model_response (ModelResponse) — The model’s response containing logprobs.
**kwargs — Additional keyword arguments.

Returns

float

The probability of the best log-prob choice being a gold choice.

Computes the log likelihood probability: chance of choosing the best choice.

Recall

class lighteval.metrics.metrics_sample.Recall

( k: int )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → int

Parameters

doc (Doc) — The document containing choices and gold indices.
model_response (ModelResponse) — The model’s response containing logprobs.
**kwargs — Additional keyword arguments.

Returns

int

Score: 1 if one of the top level predicted choices was correct, 0 otherwise.

Computes the recall at the requested depth level: looks at the n best predicted choices (with the highest log probabilities) and see if there is an actual gold among them.

MRR

class lighteval.metrics.metrics_sample.MRR

( length_normalization: bool = False )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

model_response (ModelResponse) — The model’s response containing logprobs.
doc (Doc) — The document containing choices and gold indices.
**kwargs — Additional keyword arguments.

Returns

float

MRR score.

Mean reciprocal rank. Measures the quality of a ranking of choices (ordered by correctness).

ROUGE

class lighteval.metrics.metrics_sample.ROUGE

( methods: str | list[str] multiple_golds: bool = False bootstrap: bool = False normalize_gold: typing.Optional[typing.Callable] = None normalize_pred: typing.Optional[typing.Callable] = None aggregation_function: typing.Optional[typing.Callable] = None tokenizer: object = None )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float or dict

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

float or dict

Aggregated score over the current sample’s items. If several rouge functions have been selected, returns a dict which maps name and scores.

Computes the metric(s) over a list of golds and predictions for one single sample.

BertScore

class lighteval.metrics.metrics_sample.BertScore

( normalize_gold: typing.Optional[typing.Callable] = None normalize_pred: typing.Optional[typing.Callable] = None )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → dict

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

dict

Scores over the current sample’s items.

Computes the prediction, recall and f1 score using the bert scorer.

Extractiveness

class lighteval.metrics.metrics_sample.Extractiveness

( normalize_input: <built-in function callable> = <function remove_braces at 0x7f7165a70a60> normalize_pred: <built-in function callable> = <function remove_braces_and_strip at 0x7f7165a70af0> input_column: str = 'text' language: typing.Literal['en', 'de', 'fr', 'it'] = 'en' )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → dict[str, float]

Parameters

doc (Doc) — The document containing input text.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

dict[str, float]

The extractiveness scores.

Compute the extractiveness of the predictions.

This method calculates coverage, density, and compression scores for a single prediction against the input text.

Faithfulness

class lighteval.metrics.metrics_sample.Faithfulness

( normalize_input: typing.Callable = <function remove_braces at 0x7f7165a70a60> normalize_pred: typing.Callable = <function remove_braces_and_strip at 0x7f7165a70af0> input_column: str = 'text' )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → dict[str, float]

Parameters

doc (Doc) — The document containing input text.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

dict[str, float]

The faithfulness scores.

Compute the faithfulness of the predictions.

The SummaCZS (Summary Content Zero-Shot) model is used with configurable granularity and model variation.

BLEURT

class lighteval.metrics.metrics_sample.BLEURT

( )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

float

Score over the current sample’s items.

Uses the stored BLEURT scorer to compute the score on the current sample.

BLEU

class lighteval.metrics.metrics_sample.BLEU

( n_gram: int )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

float

Score over the current sample’s items.

Computes the sentence level BLEU between the golds and each prediction, then takes the average.

StringDistance

class lighteval.metrics.metrics_sample.StringDistance

( metric_types: list[str] | str strip_prediction: bool = True )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → dict

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

dict

The different scores computed

Computes all the requested metrics on the golds and prediction.

edit_similarity

( s1 s2 ) → float

Returns

float

Edit similarity score between 0 and 1

Compute the edit similarity between two lists of strings.

Edit similarity is also used in the paper Lee, Katherine, et al. “Deduplicating training data makes language models better.” arXiv preprint arXiv:2107.06499 (2021).

longest_common_prefix_length

( s1: ndarray s2: ndarray )

Compute the length of the longest common prefix.

Metrics allowing sampling

PassAtK

class lighteval.metrics.metrics_sample.PassAtK

( k: int | None = None n: int | None = None **kwargs )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

float

Aggregated score over the current sample’s items.

Computes the metric over a list of golds and predictions for one single item with possibly many samples. It applies normalisation (if needed) to model prediction and gold, computes their per prediction score, then aggregates the scores over the samples using a pass@k.

pass_at_k

( all_scores: list )

Algo from https://arxiv.org/pdf/2107.03374

MajAtN

class lighteval.metrics.metrics_sample.MajAtN

( n: int | None = None **kwargs )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

doc (Doc) — The document containing gold references.
model_response (ModelResponse) — The model’s response containing predictions.
**kwargs — Additional keyword arguments.

Returns

float

Aggregated score over the current sample’s items.

Computes the metric over a list of golds and predictions for one single sample. It applies normalisation (if needed) to model prediction and gold, and takes the most frequent answer of all the available ones, then compares it to the gold.

AvgAtN

class lighteval.metrics.metrics_sample.AvgAtN

( n: int | None = None **kwargs )

compute

( doc: Doc model_response: ModelResponse **kwargs ) → float

Parameters

model_response (ModelResponse) — The model’s response containing predictions.
doc (Doc) — The document containing gold references.
**kwargs — Additional keyword arguments.

Returns

float

Aggregated score over the current sample’s items.

Computes the metric over a list of golds and predictions for one single sample. It applies normalisation (if needed) to model prediction and gold, and takes the most frequent answer of all the available ones, then compares it to the gold.

LLM-as-a-Judge

JudgeLM

class lighteval.metrics.utils.llm_as_judge.JudgeLM

( model: str templates: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'tgi', 'vllm', 'inference-providers'] url: str | None = None api_key: str | None = None max_tokens: int | None = None response_format: BaseModel = None hf_provider: typing.Optional[typing.Literal['black-forest-labs', 'cerebras', 'cohere', 'fal-ai', 'fireworks-ai', 'inference-providers', 'hyperbolic', 'nebius', 'novita', 'openai', 'replicate', 'sambanova', 'together']] = None backend_options: dict | None = None )

Parameters

model (str) — The name of the model.
templates (Callable) — A function taking into account the question, options, answer, and gold and returning the judge prompt.
process_judge_response (Callable) — A function for processing the judge’s response.
judge_backend (Literal[“litellm”, “openai”, “transformers”, “tgi”, “vllm”, “inference-providers”]) — The backend for the judge.
url (str | None) — The URL for the OpenAI API.
api_key (str | None) — The API key for the OpenAI API (either OpenAI or HF key).
max_tokens (int) — The maximum number of tokens to generate. Defaults to 512.
response_format (BaseModel | None) — The format of the response from the API, used for the OpenAI and TGI backend.
hf_provider (Literal[“black-forest-labs”, “cerebras”, “cohere”, “fal-ai”, “fireworks-ai”, — “inference-providers”, “hyperbolic”, “nebius”, “novita”, “openai”, “replicate”, “sambanova”, “together”] | None): The HuggingFace provider when using the inference-providers backend.
backend_options (dict | None) — Options for the backend. Currently only supported for litellm.

A class representing a judge for evaluating answers using either the chosen backend.

Methods: evaluate_answer: Evaluates an answer using the OpenAI API or Transformers library. lazy_load_client: Lazy loads the OpenAI client or Transformers pipeline. call_api: Calls the API to get the judge’s response. call_transformers: Calls the Transformers pipeline to get the judge’s response. call_vllm: Calls the VLLM pipeline to get the judge’s response.

dict_of_lists_to_list_of_dicts

( dict_of_lists )

Parameters

dict_of_lists — A dictionary where each value is a list. All lists are expected to have the same length.

Transform a dictionary of lists into a list of dictionaries.

Each dictionary in the output list will contain one element from each list in the input dictionary, with the same keys as the input dictionary.

Example:

dict_of_lists_to_list_of_dicts({‘k’: [1, 2, 3], ‘k2’: [‘a’, ‘b’, ‘c’]}) [{‘k’: 1, ‘k2’: ‘a’}, {‘k’: 2, ‘k2’: ‘b’}, {‘k’: 3, ‘k2’: ‘c’}]

evaluate_answer

( question: str answer: str options: list[str] | None = None gold: str | None = None )

Parameters

question (str) — The prompt asked to the evaluated model.
answer (str) — Answer given by the evaluated model.
options (list[str] | None) — Optional list of answer options.
gold (str | None) — Optional reference answer.

Evaluates an answer using either Transformers or OpenAI API.

JudgeLLM

class lighteval.metrics.metrics_sample.JudgeLLM

( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: pydantic.main.BaseModel | None = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None backend_options: dict | None = None )

JudgeLLMMTBench

class lighteval.metrics.metrics_sample.JudgeLLMMTBench

( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: pydantic.main.BaseModel | None = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None backend_options: dict | None = None )

compute

( model_response: list doc: list **kwargs )

Compute the score of a generative task using a llm as a judge. The generative task can be multiturn with 2 turns max, in that case, we return scores for turn 1 and 2. Also returns user_prompt and judgement which are ignored later by the aggregator.

JudgeLLMMixEval

class lighteval.metrics.metrics_sample.JudgeLLMMixEval

( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: pydantic.main.BaseModel | None = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None backend_options: dict | None = None )

compute

( responses: list docs: list **kwargs )

Compute the score of a generative task using a llm as a judge. The generative task can be multiturn with 2 turns max, in that case, we return scores for turn 1 and 2. Also returns user_prompt and judgement which are ignored later by the aggregator.

Update on GitHub

←Pipeline Tasks→