Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Evaluator

The evaluator classes for automatic evaluation.

Evaluator classes

The main entry point for using the evaluator:

evaluate.evaluator

< >

( task: str = None ) → Evaluator

Parameters

  • task (str) — The task defining which evaluator will be returned. Currently accepted tasks are:

Returns

Evaluator

An evaluator suitable for the task.

Utility factory method to build an Evaluator.

Evaluators encapsulate a task and a default metric name. They leverate pipeline functionalify from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

Examples:

>>> from evaluation import evaluator

>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

The base class for all evaluator classes:

class evaluate.Evaluator

< >

( task: str default_metric_name: str = None )

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators.

Base class implementing evaluator operations.

compute

< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 **compute_parameters: typing.Dict )

A core method of the Evaluator class, computes the metric value for a pipeline and dataset compatible with the task specified by the Evaluator.

The class for text classification evaluation:

class evaluate.TextClassificationEvaluator

< >

( task = 'text-classification' default_metric_name = None )

Text classification evaluator.

This text classification evaluator can currently be loaded from evaluator() using the default task name text-classification or with a "sentiment-analysis" alias.

Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual feature as input and a categorical label as output.

compute

< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )

Parameters

  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, —
  • defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • metric (str or EvaluationModule, defaults to None” — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. tokenizer — (str or PreTrainedTokenizer, optional, defaults to None): Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument. strategy — (Literal["simple", "bootstrap"], defaults to “simple”): specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • input_column (str, defaults to "text") — the name of the column containing the text feature in the dataset specified by data.
  • label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
  • label_mapping (Dict[str, Number], optional, defaults to None) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluation import evaluator
>>> from datasets import Dataset, load_dataset

>>> e = evaluator("text-classification")
>>> data =  Dataset.from_dict(load_dataset("imdb")["test"][:2])

>>> results = e.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     input_column="text",
>>>     label_column="label",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )