Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Evaluator

The evaluator classes for automatic evaluation.

Evaluator classes

The main entry point for using the evaluator:

evaluate.evaluator

< >

( task: str = None ) → Evaluator

Parameters

Returns

Evaluator

An evaluator suitable for the task.

Utility factory method to build an Evaluator. Evaluators encapsulate a task and a default metric name. They leverage pipeline functionalify from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

Examples:

>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

The base class for all evaluator classes:

class evaluate.Evaluator

< >

( task: str default_metric_name: str = None )

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.

compute_metric

< >

( metric: EvaluationModule metric_inputs: typing.Dict strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None )

Compute and return metrics.

predictions_processor

< >

( *args **kwargs )

A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.

prepare_data

< >

( data: typing.Union[str, datasets.arrow_dataset.Dataset] input_column: str label_column: str ) → dict

Parameters

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • input_column (str, defaults to "text") — the name of the column containing the text feature in the dataset specified by data.
  • label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.

Returns

dict

metric inputs. list: pipeline inputs.

Prepare data.

prepare_metric

< >

( metric: typing.Union[str, evaluate.module.EvaluationModule] )

Parameters

  • metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

Prepare metric.

prepare_pipeline

< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None device: int = None )

Parameters

  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, —
  • defaults to None) — If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to None) — Argument can be used to overwrite a default preprocessor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

Prepare pipeline.

The task specific evaluators

ImageClassificationEvaluator

class evaluate.ImageClassificationEvaluator

< >

( task = 'image-classification' default_metric_name = None )

Image classification evaluator. This image classification evaluator can currently be loaded from evaluator() using the default task name image-classification. Methods in this class assume a data format compatible with the ImageClassificationPipeline.

compute

< >

( input_column: str = 'image' *args **kwargs )

Parameters

  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case image-classification. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • feature_extractor (str or FeatureExtractionMixin, optional, defaults to None) — Argument can be used to overwrite a default feature extractor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • input_column (str, defaults to "image") — the name of the column containing the images as PIL ImageFile in the dataset specified by data.
  • label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
  • label_mapping (Dict[str, Number], optional, defaults to None) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="nateraw/vit-base-beans",
>>>     data=data,
>>>     label_column="labels",
>>>     metric="accuracy",
>>>     label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>>     strategy="bootstrap"
>>> )

QuestionAnsweringEvaluator

class evaluate.QuestionAnsweringEvaluator

< >

( task = 'question-answering' default_metric_name = None )

Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.

This question answering evaluator can currently be loaded from evaluator() using the default task name question-answering.

Methods in this class assume a data format compatible with the QuestionAnsweringPipeline.

compute

< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None question_column: str = 'question' context_column: str = 'context' id_column: str = 'id' label_column: str = 'answers' squad_v2_format: typing.Optional[bool] = None )

Parameters

  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case question-answering). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • question_column (str, defaults to "question") — the name of the column containing the question in the dataset specified by data.
  • context_column (str, defaults to "context") — the name of the column containing the context in the dataset specified by data.
  • id_column (str, defaults to "id") — the name of the column cointaing the identification field of the question and answer pair in the dataset specified by data.
  • label_column (str, defaults to "answers") — the name of the column containing the answers in the dataset specified by data.
  • squad_v2_format (bool, optional, defaults to None) — whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>>     data=data,
>>>     metric="squad",
>>> )

Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True to the compute() call.

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>>     data=data,
>>>     metric="squad_v2",
>>>     squad_v2_format=True,
>>> )

TextClassificationEvaluator

class evaluate.TextClassificationEvaluator

< >

( task = 'text-classification' default_metric_name = None )

Text classification evaluator. This text classification evaluator can currently be loaded from evaluator() using the default task name text-classification or with a "sentiment-analysis" alias. Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual feature as input and a categorical label as output.

compute

< >

( *args **kwargs )

Parameters

  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • input_column (str, defaults to "text") — the name of the column containing the text feature in the dataset specified by data.
  • label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
  • label_mapping (Dict[str, Number], optional, defaults to None) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )

TokenClassificationEvaluator

class evaluate.TokenClassificationEvaluator

< >

( task = 'token-classification' default_metric_name = None )

Token classification evaluator.

This token classification evaluator can currently be loaded from evaluator() using the default task name token-classification.

Methods in this class assume a data format compatible with the TokenClassificationPipeline.

compute

< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, ForwardRef('EvaluationModule')] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: typing.Optional[int] = None random_state: typing.Optional[int] = None input_column: str = 'tokens' label_column: str = 'ner_tags' join_by: typing.Optional[str] = ' ' )

Parameters

  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case token-classification). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • input_column (str, defaults to "tokens") — the name of the column containing the tokens feature in the dataset specified by data.
  • label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
  • join_by (str, optional, defaults to " ") — This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.

Compute the metric for a given pipeline and dataset combination.

The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>>     data=data,
>>>     metric="seqeval",
>>> )

For example, the following dataset format is accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
        "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
    },
    features=Features({
        "tokens": Sequence(feature=Value(dtype="string")),
        "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
        }),
)

For example, the following dataset format is not accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New York is a city and Felix a person."]],
        "starts": [[0, 23]],
        "ends": [[7, 27]],
        "ner_tags": [["LOC", "PER"]],
    },
    features=Features({
        "tokens": Value(dtype="string"),
        "starts": Sequence(feature=Value(dtype="int32")),
        "ends": Sequence(feature=Value(dtype="int32")),
        "ner_tags": Sequence(feature=Value(dtype="string")),
    }),
)