Evaluator

The evaluator classes for automatic evaluation.

Evaluator classes

The main entry point for using the evaluator:

evaluate.evaluator

< source >

( task: str = None ) → Evaluator

Parameters

task (str) — The task defining which evaluator will be returned. Currently accepted tasks are:
- "image-classification": will return a ImageClassificationEvaluator.
- "question-answering": will return a QuestionAnsweringEvaluator.
- "text-classification" (alias "sentiment-analysis" available): will return a TextClassificationEvaluator.
- "token-classification": will return a TokenClassificationEvaluator.

Returns

Evaluator

An evaluator suitable for the task.

Utility factory method to build an Evaluator. Evaluators encapsulate a task and a default metric name. They leverage pipeline functionality from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

Examples:

>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

The base class for all evaluator classes:

class evaluate.Evaluator

< source >

( task: str default_metric_name: str = None )

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.

check_required_columns

< source >

( data: typing.Union[str, datasets.arrow_dataset.Dataset] columns_names: typing.Dict[str, str] )

Parameters

data (str or Dataset) — Specifies the dataset we will run evaluation on.
columns_names (List[str]) — List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.

Ensure the columns required for the evaluation are present in the dataset.

Example:

>>> from datasets import load_dataset
>>> from evaluate import evaluator
>>> data = load_dataset("rotten_tomatoes', split="train")
>>> evaluator.check_required_columns(data, {"input_column": "text", "label_column": "label"})

compute_metric

< source >

( metric: EvaluationModule metric_inputs: typing.Dict strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None )

Compute and return metrics.

get_dataset_split

< source >

( data subset = None split = None ) → split

Parameters

data (str) — Name of dataset.
subset (str) — Name of config for datasets with multiple configurations (e.g. ‘glue/cola’).
split (str, defaults to None) — Split to use.

Returns

split

str containing which split to use

Infers which split to use if None is given.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").get_dataset_split(data="rotten_tomatoes")
WARNING:evaluate.evaluator.base:Dataset split not defined! Automatically evaluating with split: TEST
'test'

load_data

< source >

( data: typing.Union[str, datasets.arrow_dataset.Dataset] subset: str = None split: str = None ) → data (Dataset)

Parameters

data (Dataset or str, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Specifies dataset subset to be passed to name in load_dataset. To be used with datasets with several configurations (e.g. glue/sst2).
split (str, defaults to None) — User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is a str type, will automatically select the best one via choose_split().

Returns

data (Dataset)

Loaded dataset which will be used for evaluation.

Load dataset with given subset and split.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").load_data(data="rotten_tomatoes", split="train")
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

predictions_processor

< source >

( *args **kwargs )

A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.

prepare_data

< source >

( data: Dataset input_column: str label_column: str *args **kwargs ) → dict

Parameters

data (Dataset) — Specifies the dataset we will run evaluation on.
input_column (str, defaults to "text") — The name of the column containing the text feature in the dataset specified by data.
second_input_column(str, optional) — The name of the column containing the second text feature if there is one. Otherwise, set to None.
label_column (str, defaults to "label") — The name of the column containing the labels in the dataset specified by data.

Returns

dict

metric inputs. list: pipeline inputs.

Prepare data.

Example:

>>> from evaluate import evaluator
>>> from datasets import load_dataset

>>> ds = load_dataset("rotten_tomatoes", split="train")
>>> evaluator("text-classification").prepare_data(ds, input_column="text", second_input_column=None, label_column="label")

prepare_metric

< source >

( metric: typing.Union[str, evaluate.module.EvaluationModule] )

Parameters

metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

Prepare metric.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_metric("accuracy")

prepare_pipeline

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None device: int = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to None) — Argument can be used to overwrite a default preprocessor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

Prepare pipeline.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_pipeline(model_or_pipeline="distilbert-base-uncased")

The task specific evaluators

ImageClassificationEvaluator

class evaluate.ImageClassificationEvaluator

< source >

( task = 'image-classification' default_metric_name = None )

Image classification evaluator. This image classification evaluator can currently be loaded from evaluator() using the default task name image-classification. Methods in this class assume a data format compatible with the ImageClassificationPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'image' label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="nateraw/vit-base-beans",
>>>     data=data,
>>>     label_column="labels",
>>>     metric="accuracy",
>>>     label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>>     strategy="bootstrap"
>>> )

QuestionAnsweringEvaluator

class evaluate.QuestionAnsweringEvaluator

< source >

( task = 'question-answering' default_metric_name = None )

Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.

This question answering evaluator can currently be loaded from evaluator() using the default task name question-answering.

Methods in this class assume a data format compatible with the QuestionAnsweringPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None question_column: str = 'question' context_column: str = 'context' id_column: str = 'id' label_column: str = 'answers' squad_v2_format: typing.Optional[bool] = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>>     data=data,
>>>     metric="squad",
>>> )

Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True to the compute() call.

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>>     data=data,
>>>     metric="squad_v2",
>>>     squad_v2_format=True,
>>> )

TextClassificationEvaluator

class evaluate.TextClassificationEvaluator

< source >

( task = 'text-classification' default_metric_name = None )

Text classification evaluator. This text classification evaluator can currently be loaded from evaluator() using the default task name text-classification or with a "sentiment-analysis" alias. Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual feature as input and a categorical label as output.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' second_input_column: typing.Optional[str] = None label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )

TokenClassificationEvaluator

class evaluate.TokenClassificationEvaluator

< source >

( task = 'token-classification' default_metric_name = None )

Token classification evaluator.

This token classification evaluator can currently be loaded from evaluator() using the default task name token-classification.

Methods in this class assume a data format compatible with the TokenClassificationPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: str = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: typing.Optional[int] = None random_state: typing.Optional[int] = None input_column: str = 'tokens' label_column: str = 'ner_tags' join_by: typing.Optional[str] = ' ' )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.

The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>>     data=data,
>>>     metric="seqeval",
>>> )

For example, the following dataset format is accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
        "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
    },
    features=Features({
        "tokens": Sequence(feature=Value(dtype="string")),
        "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
        }),
)

For example, the following dataset format is not accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New York is a city and Felix a person."]],
        "starts": [[0, 23]],
        "ends": [[7, 27]],
        "ner_tags": [["LOC", "PER"]],
    },
    features=Features({
        "tokens": Value(dtype="string"),
        "starts": Sequence(feature=Value(dtype="int32")),
        "ends": Sequence(feature=Value(dtype="int32")),
        "ner_tags": Sequence(feature=Value(dtype="string")),
    }),
)

TextGenerationEvaluator

class evaluate.TextGenerationEvaluator

< source >

( task = 'text-generation' default_metric_name = None predictions_prefix: str = 'generated' )

Text generation evaluator. This Text generation evaluator can currently be loaded from evaluator() using the default task name text-generation. Methods in this class assume a data format compatible with the TextGenerationPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )

Text2TextGenerationEvaluator

class evaluate.Text2TextGenerationEvaluator

< source >

( task = 'text2text-generation' default_metric_name = None )

Text2Text generation evaluator. This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name text2text-generation. Methods in this class assume a data format compatible with the Text2TextGenerationPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") — the name of the column containing the input text in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
generation_kwargs (Dict, optional, defaults to None) — The generation kwargs are passed to the pipeline and set the text generation strategy.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text2text-generation")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="facebook/bart-large-cnn",
>>>     data=data,
>>>     input_column="article",
>>>     label_column="highlights",
>>>     metric="rouge",
>>> )

SummarizationEvaluator

class evaluate.SummarizationEvaluator

< source >

( task = 'summarization' default_metric_name = None )

Text summarization evaluator. This text summarization evaluator can currently be loaded from evaluator() using the default task name summarization. Methods in this class assume a data format compatible with the SummarizationEvaluator.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") — the name of the column containing the input text in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
generation_kwargs (Dict, optional, defaults to None) — The generation kwargs are passed to the pipeline and set the text generation strategy.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("summarization")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="facebook/bart-large-cnn",
>>>     data=data,
>>>     input_column="article",
>>>     label_column="highlights",
>>> )

TranslationEvaluator

class evaluate.TranslationEvaluator

< source >

( task = 'translation' default_metric_name = None )

Translation evaluator. This translation generation evaluator can currently be loaded from evaluator() using the default task name translation. Methods in this class assume a data format compatible with the TranslationPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") — the name of the column containing the input text in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
generation_kwargs (Dict, optional, defaults to None) — The generation kwargs are passed to the pipeline and set the text generation strategy.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("translation")
>>> data = load_dataset("wmt19", "fr-de", split="validation[:40]")
>>> data = data.map(lambda x: {"text": x["translation"]["de"], "label": x["translation"]["fr"]})
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="Helsinki-NLP/opus-mt-de-fr",
>>>     data=data,
>>> )

AutomaticSpeechRecognitionEvaluator

class evaluate.AutomaticSpeechRecognitionEvaluator

< source >

( task = 'automatic-speech-recognition' default_metric_name = None )

Automatic speech recognition evaluator. This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name automatic-speech-recognition. Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'path' label_column: str = 'sentence' generation_kwargs: dict = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>>     data=data,
>>>     input_column="path",
>>>     label_column="sentence",
>>>     metric="wer",
>>> )

AudioClassificationEvaluator

class evaluate.AudioClassificationEvaluator

< source >

( task = 'audio-classification' default_metric_name = None )

Audio classification evaluator. This audio classification evaluator can currently be loaded from evaluator() using the default task name audio-classification. Methods in this class assume a data format compatible with the transformers.AudioClassificationPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'file' label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) — Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) — Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.

Examples:

Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)

>>> from evaluate import evaluator
>>> from datasets import load_dataset

>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>>     data=data,
>>>     label_column="label",
>>>     input_column="file",
>>>     metric="accuracy",
>>>     label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )

The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.

>>> from evaluate import evaluator
>>> from datasets import load_dataset

>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> data = data.map(lambda example: {"audio": example["audio"]["array"]})
>>> results = task_evaluator.compute(
>>>     model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>>     data=data,
>>>     label_column="label",
>>>     input_column="audio",
>>>     metric="accuracy",
>>>     label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )

< > Update on GitHub

Evaluate

Evaluator

Evaluator classes

evaluate.evaluator

class evaluate.Evaluator

check_required_columns

compute_metric

get_dataset_split

load_data

predictions_processor

prepare_data

prepare_metric

prepare_pipeline

The task specific evaluators

ImageClassificationEvaluator

class evaluate.ImageClassificationEvaluator

compute

QuestionAnsweringEvaluator

class evaluate.QuestionAnsweringEvaluator

compute

TextClassificationEvaluator

class evaluate.TextClassificationEvaluator

compute

TokenClassificationEvaluator

class evaluate.TokenClassificationEvaluator

compute

TextGenerationEvaluator

class evaluate.TextGenerationEvaluator

compute

Text2TextGenerationEvaluator

class evaluate.Text2TextGenerationEvaluator

compute

SummarizationEvaluator

class evaluate.SummarizationEvaluator

compute

TranslationEvaluator

class evaluate.TranslationEvaluator

compute

AutomaticSpeechRecognitionEvaluator

class evaluate.AutomaticSpeechRecognitionEvaluator

compute

AudioClassificationEvaluator

class evaluate.AudioClassificationEvaluator

compute