Evaluator
The evaluator classes for automatic evaluation.
Evaluator classes
The main entry point for using the evaluator:
evaluate.evaluator
< source >( task: str = None ) → Evaluator
Parameters
-
task (
str
) — The task defining which evaluator will be returned. Currently accepted tasks are:"text-classification"
(alias"sentiment-analysis"
available): will return a TextClassificationEvaluator.
Returns
An evaluator suitable for the task.
Utility factory method to build an Evaluator.
Evaluators encapsulate a task and a default metric name. They leverate pipeline
functionalify from transformers
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
The base class for all evaluator classes:
The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators.
Base class implementing evaluator operations.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 **compute_parameters: typing.Dict )
A core method of the Evaluator
class, computes the metric value for a pipeline and dataset compatible
with the task specified by the Evaluator
.
The class for text classification evaluation:
class evaluate.TextClassificationEvaluator
< source >( task = 'text-classification' default_metric_name = None )
Text classification evaluator.
This text classification evaluator can currently be loaded from evaluator() using the default task name
text-classification
or with a "sentiment-analysis"
alias.
Methods in this class assume a data format compatible with the TextClassificationPipeline
- a single textual
feature as input and a categorical label as output.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, — -
defaults to
None
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of type
str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
metric (
str
orEvaluationModule
, defaults toNone
” — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. tokenizer — (str
orPreTrainedTokenizer
, optional, defaults toNone
): Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. strategy — (Literal["simple", "bootstrap"]
, defaults to “simple”): specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html`.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
input_column (
str
, defaults to"text"
) — the name of the column containing the text feature in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. -
label_mapping (
Dict[str, Number]
, optional, defaults toNone
) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in thelabel_column
of thedata
dataset.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluation import evaluator
>>> from datasets import Dataset, load_dataset
>>> e = evaluator("text-classification")
>>> data = Dataset.from_dict(load_dataset("imdb")["test"][:2])
>>> results = e.compute(
>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>> data=data,
>>> metric="accuracy",
>>> input_column="text",
>>> label_column="label",
>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>> strategy="bootstrap",
>>> n_resamples=10,
>>> random_state=0
>>> )