Evaluator
The evaluator classes for automatic evaluation.
Evaluator classes
The main entry point for using the evaluator:
evaluate.evaluator
< source >( task: str = None ) → Evaluator
Parameters
- task (
str
) — The task defining which evaluator will be returned. Currently accepted tasks are:"image-classification"
: will return a ImageClassificationEvaluator."question-answering"
: will return a QuestionAnsweringEvaluator."text-classification"
(alias"sentiment-analysis"
available): will return a TextClassificationEvaluator."token-classification"
: will return a TokenClassificationEvaluator.
Returns
An evaluator suitable for the task.
Utility factory method to build an Evaluator.
Evaluators encapsulate a task and a default metric name. They leverage pipeline
functionality from transformers
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
The base class for all evaluator classes:
The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.
check_required_columns
< source >( data: typing.Union[str, datasets.arrow_dataset.Dataset] columns_names: typing.Dict[str, str] )
Parameters
- data (
str
orDataset
) — Specifies the dataset we will run evaluation on. - columns_names (
List[str]
) — List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.
Ensure the columns required for the evaluation are present in the dataset.
compute_metric
< source >( metric: EvaluationModule metric_inputs: typing.Dict strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None )
Compute and return metrics.
get_dataset_split
< source >( data subset = None split = None ) → split
Infers which split to use if None
is given.
load_data
< source >( data: typing.Union[str, datasets.arrow_dataset.Dataset] subset: str = None split: str = None ) → data (Dataset
)
Parameters
- data (
Dataset
orstr
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Specifies dataset subset to be passed toname
inload_dataset
. To be used with datasets with several configurations (e.g. glue/sst2). - split (
str
, defaults toNone
) — User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]
). If not defined and data is astr
type, will automatically select the best one viachoose_split()
.
Returns
data (Dataset
)
Loaded dataset which will be used for evaluation.
Load dataset with given subset and split.
A core method of the Evaluator
class, which processes the pipeline outputs for compatibility with the metric.
prepare_data
< source >( data: Dataset input_column: str label_column: str *args **kwargs ) → dict
Parameters
- data (
Dataset
) — Specifies the dataset we will run evaluation on. - input_column (
str
, defaults to"text"
) — The name of the column containing the text feature in the dataset specified bydata
. - second_input_column(
str
, optional) — The name of the column containing the second text feature if there is one. Otherwise, set toNone
. - label_column (
str
, defaults to"label"
) — The name of the column containing the labels in the dataset specified bydata
.
Returns
dict
metric inputs.
list
: pipeline inputs.
Prepare data.
prepare_metric
< source >( metric: typing.Union[str, evaluate.module.EvaluationModule] )
Parameters
- metric (
str
or EvaluationModule, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
Prepare metric.
prepare_pipeline
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None device: int = None )
Parameters
- model_or_pipeline (
str
or Pipeline orCallable
or PreTrainedModel or TFPreTrainedModel, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the typestr
or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to
None
) — Argument can be used to overwrite a default preprocessor ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument.
Prepare pipeline.
The task specific evaluators
ImageClassificationEvaluator
class evaluate.ImageClassificationEvaluator
< source >( task = 'image-classification' default_metric_name = None )
Image classification evaluator.
This image classification evaluator can currently be loaded from evaluator() using the default task name
image-classification
.
Methods in this class assume a data format compatible with the ImageClassificationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'image' label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="nateraw/vit-base-beans",
>>> data=data,
>>> label_column="labels",
>>> metric="accuracy",
>>> label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>> strategy="bootstrap"
>>> )
QuestionAnsweringEvaluator
class evaluate.QuestionAnsweringEvaluator
< source >( task = 'question-answering' default_metric_name = None )
Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.
This question answering evaluator can currently be loaded from evaluator() using the default task name
question-answering
.
Methods in this class assume a data format compatible with the QuestionAnsweringPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None question_column: str = 'question' context_column: str = 'context' id_column: str = 'id' label_column: str = 'answers' squad_v2_format: typing.Optional[bool] = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>> data=data,
>>> metric="squad",
>>> )
Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True
to
the compute() call.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>> data=data,
>>> metric="squad_v2",
>>> squad_v2_format=True,
>>> )
TextClassificationEvaluator
class evaluate.TextClassificationEvaluator
< source >( task = 'text-classification' default_metric_name = None )
Text classification evaluator.
This text classification evaluator can currently be loaded from evaluator() using the default task name
text-classification
or with a "sentiment-analysis"
alias.
Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual
feature as input and a categorical label as output.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' second_input_column: typing.Optional[str] = None label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>> data=data,
>>> metric="accuracy",
>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>> strategy="bootstrap",
>>> n_resamples=10,
>>> random_state=0
>>> )
TokenClassificationEvaluator
class evaluate.TokenClassificationEvaluator
< source >( task = 'token-classification' default_metric_name = None )
Token classification evaluator.
This token classification evaluator can currently be loaded from evaluator() using the default task name
token-classification
.
Methods in this class assume a data format compatible with the TokenClassificationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: str = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: typing.Optional[int] = None random_state: typing.Optional[int] = None input_column: str = 'tokens' label_column: str = 'ner_tags' join_by: typing.Optional[str] = ' ' )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>> data=data,
>>> metric="seqeval",
>>> )
For example, the following dataset format is accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
"ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
},
features=Features({
"tokens": Sequence(feature=Value(dtype="string")),
"ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
}),
)
For example, the following dataset format is not accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New York is a city and Felix a person."]],
"starts": [[0, 23]],
"ends": [[7, 27]],
"ner_tags": [["LOC", "PER"]],
},
features=Features({
"tokens": Value(dtype="string"),
"starts": Sequence(feature=Value(dtype="int32")),
"ends": Sequence(feature=Value(dtype="int32")),
"ner_tags": Sequence(feature=Value(dtype="string")),
}),
)
TextGenerationEvaluator
class evaluate.TextGenerationEvaluator
< source >( task = 'text-generation' default_metric_name = None predictions_prefix: str = 'generated' )
Text generation evaluator.
This Text generation evaluator can currently be loaded from evaluator() using the default task name
text-generation
.
Methods in this class assume a data format compatible with the TextGenerationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )
Text2TextGenerationEvaluator
class evaluate.Text2TextGenerationEvaluator
< source >( task = 'text2text-generation' default_metric_name = None )
Text2Text generation evaluator.
This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name
text2text-generation
.
Methods in this class assume a data format compatible with the Text2TextGenerationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. - input_column (
str
, defaults to"text"
) — the name of the column containing the input text in the dataset specified bydata
. - label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. - generation_kwargs (
Dict
, optional, defaults toNone
) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text2text-generation")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="facebook/bart-large-cnn",
>>> data=data,
>>> input_column="article",
>>> label_column="highlights",
>>> metric="rouge",
>>> )
SummarizationEvaluator
class evaluate.SummarizationEvaluator
< source >( task = 'summarization' default_metric_name = None )
Text summarization evaluator.
This text summarization evaluator can currently be loaded from evaluator() using the default task name
summarization
.
Methods in this class assume a data format compatible with the SummarizationEvaluator.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. - input_column (
str
, defaults to"text"
) — the name of the column containing the input text in the dataset specified bydata
. - label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. - generation_kwargs (
Dict
, optional, defaults toNone
) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("summarization")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="facebook/bart-large-cnn",
>>> data=data,
>>> input_column="article",
>>> label_column="highlights",
>>> )
TranslationEvaluator
Translation evaluator.
This translation generation evaluator can currently be loaded from evaluator() using the default task name
translation
.
Methods in this class assume a data format compatible with the TranslationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. - input_column (
str
, defaults to"text"
) — the name of the column containing the input text in the dataset specified bydata
. - label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. - generation_kwargs (
Dict
, optional, defaults toNone
) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("translation")
>>> data = load_dataset("wmt19", "fr-de", split="validation[:40]")
>>> data = data.map(lambda x: {"text": x["translation"]["de"], "label": x["translation"]["fr"]})
>>> results = task_evaluator.compute(
>>> model_or_pipeline="Helsinki-NLP/opus-mt-de-fr",
>>> data=data,
>>> )
AutomaticSpeechRecognitionEvaluator
class evaluate.AutomaticSpeechRecognitionEvaluator
< source >( task = 'automatic-speech-recognition' default_metric_name = None )
Automatic speech recognition evaluator.
This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name
automatic-speech-recognition
.
Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'path' label_column: str = 'sentence' generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>> data=data,
>>> input_column="path",
>>> label_column="sentence",
>>> metric="wer",
>>> )
AudioClassificationEvaluator
class evaluate.AudioClassificationEvaluator
< source >( task = 'audio-classification' default_metric_name = None )
Audio classification evaluator.
This audio classification evaluator can currently be loaded from evaluator() using the default task name
audio-classification
.
Methods in this class assume a data format compatible with the transformers.AudioClassificationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'file' label_column: str = 'label' label_mapping: typing.Optional[typing.Dict[str, numbers.Number]] = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>> data=data,
>>> label_column="label",
>>> input_column="file",
>>> metric="accuracy",
>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )
The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> data = data.map(lambda example: {"audio": example["audio"]["array"]})
>>> results = task_evaluator.compute(
>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>> data=data,
>>> label_column="label",
>>> input_column="audio",
>>> metric="accuracy",
>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )