Lighteval documentation
Metrics
Metrics
Metrics
Metric
class lighteval.metrics.Metric
< source >( metric_name: str higher_is_better: bool category: SamplingMethod sample_level_fn: typing.Callable corpus_level_fn: typing.Callable batched_compute: bool = False )
CorpusLevelMetric
class lighteval.metrics.utils.metric_utils.CorpusLevelMetric
< source >( metric_name: str higher_is_better: bool category: SamplingMethod sample_level_fn: typing.Callable corpus_level_fn: typing.Callable batched_compute: bool = False )
Metric computed over the whole corpora, with computations happening at the aggregation phase
SampleLevelMetric
class lighteval.metrics.utils.metric_utils.SampleLevelMetric
< source >( metric_name: str higher_is_better: bool category: SamplingMethod sample_level_fn: typing.Callable corpus_level_fn: typing.Callable batched_compute: bool = False )
Metric computed per sample, then aggregated over the corpus
MetricGrouping
class lighteval.metrics.utils.metric_utils.MetricGrouping
< source >( metric_name: list higher_is_better: dict category: SamplingMethod sample_level_fn: typing.Callable corpus_level_fn: dict batched_compute: bool = False )
Some metrics are more advantageous to compute together at once. For example, if a costly preprocessing is the same for all metrics, it makes more sense to compute it once.
CorpusLevelMetricGrouping
class lighteval.metrics.utils.metric_utils.CorpusLevelMetricGrouping
< source >( metric_name: list higher_is_better: dict category: SamplingMethod sample_level_fn: typing.Callable corpus_level_fn: dict batched_compute: bool = False )
MetricGrouping computed over the whole corpora, with computations happening at the aggregation phase
SampleLevelMetricGrouping
class lighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping
< source >( metric_name: list higher_is_better: dict category: SamplingMethod sample_level_fn: typing.Callable corpus_level_fn: dict batched_compute: bool = False )
MetricGrouping are computed per sample, then aggregated over the corpus
Corpus Metrics
CorpusLevelF1Score
class lighteval.metrics.metrics_corpus.CorpusLevelF1Score
< source >( average: str num_classes: int = 2 )
Computes the metric score over all the corpus generated items, by using the scikit learn implementation.
CorpusLevelPerplexityMetric
Computes the metric score over all the corpus generated items.
CorpusLevelTranslationMetric
class lighteval.metrics.metrics_corpus.CorpusLevelTranslationMetric
< source >( metric_type: str lang: typing.Literal['zh', 'ja', 'ko', ''] = '' )
Computes the metric score over all the corpus generated items, by using the sacrebleu implementation.
matthews_corrcoef
lighteval.metrics.metrics_corpus.matthews_corrcoef
< source >( items: list ) → float
Computes the Matthews Correlation Coefficient, using scikit learn (doc).
Sample Metrics
ExactMatches
class lighteval.metrics.metrics_sample.ExactMatches
< source >( aggregation_function: typing.Callable[[list[float]], float] = <built-in function max> normalize_gold: typing.Optional[typing.Callable[[str], str]] = None normalize_pred: typing.Optional[typing.Callable[[str], str]] = None strip_strings: bool = False type_exact_match: str = 'full' )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → float
Computes the metric over a list of golds and predictions for one single sample.
compute_one_item
< source >( gold: str pred: str ) → float
Compares two strings only.
F1_score
class lighteval.metrics.metrics_sample.F1_score
< source >( aggregation_function: typing.Callable[[list[float]], float] = <built-in function max> normalize_gold: typing.Optional[typing.Callable[[str], str]] = None normalize_pred: typing.Optional[typing.Callable[[str], str]] = None strip_strings: bool = False )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → float
Computes the metric over a list of golds and predictions for one single sample.
compute_one_item
< source >( gold: str pred: str ) → float
Compares two strings only.
LoglikelihoodAcc
class lighteval.metrics.metrics_sample.LoglikelihoodAcc
< source >( logprob_normalization: lighteval.metrics.normalizations.LogProbCharNorm | lighteval.metrics.normalizations.LogProbTokenNorm | lighteval.metrics.normalizations.LogProbPMINorm | None = None )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → int
Parameters
- gold_ixs (list[int]) — All the gold choices indices
- choices_logprob (list[float]) — Summed log-probabilities of all the possible choices for the model, ordered as the choices.
- unconditioned_logprob (list[float] | None) — Unconditioned log-probabilities for PMI normalization, ordered as the choices.
- choices_tokens (list[list[int]] | None) — Tokenized choices for token normalization, ordered as the choices.
- formatted_doc (Doc) — Original document for the sample. Used to get the original choices’ length for possible normalization
Returns
int
The eval score: 1 if the best log-prob choice is in gold, 0 otherwise.
Computes the log likelihood accuracy: is the choice with the highest logprob in choices_logprob
present
in the gold_ixs
?
NormalizedMultiChoiceProbability
class lighteval.metrics.metrics_sample.NormalizedMultiChoiceProbability
< source >( log_prob_normalization: lighteval.metrics.normalizations.LogProbCharNorm | lighteval.metrics.normalizations.LogProbTokenNorm | lighteval.metrics.normalizations.LogProbPMINorm | None = None aggregation_function: typing.Callable[[numpy.ndarray], float] = <function max at 0x7f610bb42e30> )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → float
Parameters
- gold_ixs (list[int]) — All the gold choices indices
- choices_logprob (list[float]) — Summed log-probabilities of all the possible choices for the model, ordered as the choices.
- unconditioned_logprob (list[float] | None) — Unconditioned log-probabilities for PMI normalization, ordered as the choices.
- choices_tokens (list[list[int]] | None) — Tokenized choices for token normalization, ordered as the choices.
- formatted_doc (Doc) — Original document for the sample. Used to get the original choices’ length for possible normalization
Returns
float
The probability of the best log-prob choice being a gold choice.
Computes the log likelihood probability: chance of choosing the best choice.
Probability
class lighteval.metrics.metrics_sample.Probability
< source >( normalization: lighteval.metrics.normalizations.LogProbTokenNorm | None = None aggregation_function: typing.Callable[[numpy.ndarray], float] = <function max at 0x7f610bb42e30> )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → float
Parameters
- gold_ixs (list[int]) — All the gold choices indices
- choices_logprob (list[float]) — Summed log-probabilities of all the possible choices for the model, ordered as the choices.
- unconditioned_logprob (list[float] | None) — Unconditioned log-probabilities for PMI normalization, ordered as the choices.
- choices_tokens (list[list[int]] | None) — Tokenized choices for token normalization, ordered as the choices.
- reference_texts (list[str] | None) — Reference texts for token normalization, ordered as the choices.
Returns
float
The probability of the best log-prob choice being a gold choice.
Computes the log likelihood probability: chance of choosing the best choice.
Recall
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → int
Computes the recall at the requested depth level: looks at the n
best predicted choices (with the
highest log probabilities) and see if there is an actual gold among them.
MRR
compute
< source >( model_response: ModelResponse doc: Doc **kwargs ) → float
Parameters
- gold_ixs (list[int]) — All the gold choices indices
- choices_logprob (list[float]) — Summed log-probabilities of all the possible choices for the model, ordered as the choices.
- formatted_doc (Doc) — Original document for the sample. Used to get the original choices’ length for possible normalization
Returns
float
MRR score.
Mean reciprocal rank. Measures the quality of a ranking of choices (ordered by correctness).
ROUGE
class lighteval.metrics.metrics_sample.ROUGE
< source >( methods: str | list[str] multiple_golds: bool = False bootstrap: bool = False normalize_gold: typing.Optional[typing.Callable] = None normalize_pred: typing.Optional[typing.Callable] = None aggregation_function: typing.Optional[typing.Callable] = None tokenizer: object = None )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → float or dict
Computes the metric(s) over a list of golds and predictions for one single sample.
BertScore
class lighteval.metrics.metrics_sample.BertScore
< source >( normalize_gold: typing.Optional[typing.Callable] = None normalize_pred: typing.Optional[typing.Callable] = None )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → dict
Computes the prediction, recall and f1 score using the bert scorer.
Extractiveness
class lighteval.metrics.metrics_sample.Extractiveness
< source >( normalize_input: <built-in function callable> = <function remove_braces at 0x7f5ffcf7b910> normalize_pred: <built-in function callable> = <function remove_braces_and_strip at 0x7f5ffcf7b9a0> input_column: str = 'text' )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → dict[str, float]
Compute the extractiveness of the predictions.
This method calculates coverage, density, and compression scores for a single prediction against the input text.
Faithfulness
class lighteval.metrics.metrics_sample.Faithfulness
< source >( normalize_input: typing.Callable = <function remove_braces at 0x7f5ffcf7b910> normalize_pred: typing.Callable = <function remove_braces_and_strip at 0x7f5ffcf7b9a0> input_column: str = 'text' )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → dict[str, float]
Compute the faithfulness of the predictions.
The SummaCZS (Summary Content Zero-Shot) model is used with configurable granularity and model variation.
BLEURT
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → float
Uses the stored BLEURT scorer to compute the score on the current sample.
BLEU
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → float
Computes the sentence level BLEU between the golds and each prediction, then takes the average.
StringDistance
class lighteval.metrics.metrics_sample.StringDistance
< source >( metric_types: list[str] | str strip_prediction: bool = True )
compute
< source >( doc: Doc model_response: ModelResponse **kwargs ) → dict
Computes all the requested metrics on the golds and prediction.
Compute the edit similarity between two lists of strings.
Edit similarity is also used in the paper Lee, Katherine, et al. “Deduplicating training data makes language models better.” arXiv preprint arXiv:2107.06499 (2021).
Compute the length of the longest common prefix.
JudgeLLM
class lighteval.metrics.metrics_sample.JudgeLLM
< source >( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: pydantic.main.BaseModel | None = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None )
JudgeLLMMTBench
class lighteval.metrics.metrics_sample.JudgeLLMMTBench
< source >( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: pydantic.main.BaseModel | None = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None )
Compute the score of a generative task using a llm as a judge. The generative task can be multiturn with 2 turns max, in that case, we return scores for turn 1 and 2. Also returns user_prompt and judgement which are ignored later by the aggregator.
JudgeLLMMixEval
class lighteval.metrics.metrics_sample.JudgeLLMMixEval
< source >( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: pydantic.main.BaseModel | None = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None )
Compute the score of a generative task using a llm as a judge. The generative task can be multiturn with 2 turns max, in that case, we return scores for turn 1 and 2. Also returns user_prompt and judgement which are ignored later by the aggregator.
MajAtK
class lighteval.metrics.metrics_sample.MajAtK
< source >( k: int normalize_gold: typing.Optional[typing.Callable] = None normalize_pred: typing.Optional[typing.Callable] = None strip_strings: bool = False type_exact_match: str = 'full' )
compute
< source >( model_response: ModelResponse docs: Doc **kwargs ) → float
Computes the metric over a list of golds and predictions for one single sample. It applies normalisation (if needed) to model prediction and gold, and takes the most frequent answer of all the available ones, then compares it to the gold.
LLM-as-a-Judge
JudgeLM
class lighteval.metrics.llm_as_judge.JudgeLM
< source >( model: str templates: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'tgi', 'vllm', 'inference-providers'] url: str | None = None api_key: str | None = None max_tokens: int = 512 response_format: BaseModel = None hf_provider: typing.Optional[typing.Literal['black-forest-labs', 'cerebras', 'cohere', 'fal-ai', 'fireworks-ai', 'inference-providers', 'hyperbolic', 'nebius', 'novita', 'openai', 'replicate', 'sambanova', 'together']] = None )
Parameters
- model (str) — The name of the model.
- templates (Callable) — A function taking into account the question, options, answer, and gold and returning the judge prompt.
- process_judge_response (Callable) — A function for processing the judge’s response.
- judge_backend (Literal[“openai”, “transformers”, “tgi”, “vllm”]) — The backend for the judge.
- url (str | None) — The URL for the OpenAI API.
- api_key (str | None) — The API key for the OpenAI API (either OpenAI or HF key).
- model (str) — The name of the model.
- template (Callable) — A function taking into account the question, options, answer, and gold and returning the judge prompt.
- API_MAX_RETRY (int) — The maximum number of retries for the API.
- API_RETRY_SLEEP (int) — The time to sleep between retries.
- client (OpenAI | None) — The OpenAI client.
- pipe (LLM | AutoModel | None) — The Transformers or vllm pipeline.
- process_judge_response (Callable) — A function for processing the judge’s response.
- url (str | None) — The URL for the OpenAI API.
- api_key (str | None) — The API key for the OpenAI API (either OpenAI or HF key).
- backend (Literal[“openai”, “transformers”, “tgi”, “vllm”]) — The backend for the judge
A class representing a judge for evaluating answers using either the OpenAI or Transformers library.
Methods: evaluate_answer: Evaluates an answer using the OpenAI API or Transformers library. lazy_load_client: Lazy loads the OpenAI client or Transformers pipeline. call_api: Calls the API to get the judge’s response. call_transformers: Calls the Transformers pipeline to get the judge’s response. call_vllm: Calls the VLLM pipeline to get the judge’s response.
dict_of_lists_to_list_of_dicts
< source >( dict_of_lists )
Transform a dictionary of lists into a list of dictionaries.
Each dictionary in the output list will contain one element from each list in the input dictionary, with the same keys as the input dictionary.
Example:
dict_of_lists_to_list_of_dicts({‘k’: [1, 2, 3], ‘k2’: [‘a’, ‘b’, ‘c’]}) [{‘k’: 1, ‘k2’: ‘a’}, {‘k’: 2, ‘k2’: ‘b’}, {‘k’: 3, ‘k2’: ‘c’}]
evaluate_answer
< source >( question: str answer: str options: list[str] | None = None gold: str | None = None )
Evaluates an answer using either Transformers or OpenAI API.