Evaluate predictions

🤗 Datasets provides various common and NLP-specific metrics for you to measure your models performance. In the final part of the tutorials, you will load a metric and use it to evaluate your models predictions.

You can see what metrics are available with datasets.list_metrics():

>>> from datasets import list_metrics
>>> metrics_list = list_metrics()
>>> len(metrics_list)
>>> print(metrics_list)
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor', 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli']

Load metric

It is very easy to load a metric with 🤗 Datasets. In fact, you will notice that it is very similar to loading a dataset! Load a metric from the Hub with datasets.load_metric():

>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc')

This will load the metric associated with the MRPC dataset from the GLUE benchmark.

Select a configuration

If you are using a benchmark dataset, you need to select a metric that is associated with the configuration you are using. Select a metric configuration by providing the configuration name:

>>> metric = load_metric('glue', 'mrpc')

Metrics object

Before you begin using a datasets.Metric object, you should get to know it a little better. As with a dataset, you can return some basic information about a metric. For example, use datasets.Metric.inputs_descriptions to get more information about a metrics expected input format and some usage examples:

>>> print(metric.inputs_description)
Compute GLUE evaluation metric associated to each GLUE dataset.
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}
    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

Notice for the MRPC configuration, the metric expects the input format to be zero or one. For a complete list of attributes you can return with your metric, take a look at datasets.MetricInfo.

Compute metric

Once you have loaded a metric, you are ready to use it to evaluate a models predictions. Provide the model predictions and references to datasets.Metric.compute:

>>> model_predictions = model(model_inputs)
>>> final_score = metric.compute(predictions=model_predictions, references=gold_references)

What’s next?

Congratulations, you have completed your first 🤗 Datasets tutorial!

Over the course of these tutorials, you learned the basic steps of using 🤗 Datasets. You loaded a dataset from the Hub, and learned how to access the information stored inside the dataset. Next, you tokenized the dataset into sequences of integers, and formatted it so you can use it with PyTorch or TensorFlow. Finally, you loaded a metric to evaluate your models predictions. This is all you need to get started with 🤗 Datasets!

Now that you have a solid grasp of what 🤗 Datasets can do, you can begin formulating your own questions about how you can use it with your dataset. Please take a look at our How-to guides for more practical help on solving common use-cases, or read our Conceptual guides to deepen your understanding about 🤗 Datasets.