Evaluator

The evaluator classes for automatic evaluation.

Evaluator classes

The main entry point for using the evaluator:

evaluate.evaluator

< source >

( task: str = None ) → Evaluator

Parameters

task (str) — The task defining which evaluator will be returned. Currently accepted tasks are:
- "text-classification" (alias "sentiment-analysis" available): will return a TextClassificationEvaluator.

Returns

Evaluator

An evaluator suitable for the task.

Utility factory method to build an Evaluator.

Evaluators encapsulate a task and a default metric name. They leverate pipeline functionalify from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

Examples:

>>> from evaluation import evaluator

>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

The base class for all evaluator classes:

class evaluate.Evaluator

< source >

( task: str default_metric_name: str = None )

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators.

Base class implementing evaluator operations.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 **compute_parameters: typing.Dict )

A core method of the Evaluator class, computes the metric value for a pipeline and dataset compatible with the task specified by the Evaluator.

The class for text classification evaluation:

class evaluate.TextClassificationEvaluator

< source >

( task = 'text-classification' default_metric_name = None )

Text classification evaluator.

This text classification evaluator can currently be loaded from evaluator() using the default task name text-classification or with a "sentiment-analysis" alias.

Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual feature as input and a categorical label as output.

compute

< source >

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, —
defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
metric (str or EvaluationModule, defaults to None” — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. tokenizer — (str or PreTrainedTokenizer, optional, defaults to None): Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument. strategy — (Literal["simple", "bootstrap"], defaults to “simple”): specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html`.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") — the name of the column containing the text feature in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
label_mapping (Dict[str, Number], optional, defaults to None) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluation import evaluator
>>> from datasets import Dataset, load_dataset

>>> e = evaluator("text-classification")
>>> data =  Dataset.from_dict(load_dataset("imdb")["test"][:2])

>>> results = e.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     input_column="text",
>>>     label_column="label",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )

Evaluate

Evaluator

Evaluator classes

evaluate.evaluator

class evaluate.Evaluator

compute

class evaluate.TextClassificationEvaluator

compute