Evaluate documentation

Using the evaluator with custom pipelines

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.4.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Using the evaluator with custom pipelines

The evaluator is designed to work with transformer pipelines out-of-the-box. However, in many cases you might have a model or pipeline that’s not part of the transformer ecosystem. You can still use evaluator to easily compute metrics for them. In this guide we show how to do this for a Scikit-Learn pipeline and a Spacy pipeline. Let’s start with the Scikit-Learn case.

Scikit-Learn

First we need to train a model. We’ll train a simple text classifier on the IMDb dataset, so let’s start by downloading the dataset:

from datasets import load_dataset

ds = load_dataset("imdb")

Then we can build a simple TF-IDF preprocessor and Naive Bayes classifier wrapped in a Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB()),
])

text_clf.fit(ds["train"]["text"], ds["train"]["label"])

Following the convention in the TextClassificationPipeline of transformers our pipeline should be callable and return a list of dictionaries. In addition we use the task attribute to check if the pipeline is compatible with the evaluator. We can write a small wrapper class for that purpose:

class ScikitEvalPipeline:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.task = "text-classification"

    def __call__(self, input_texts, **kwargs):
        return [{"label": p} for p in self.pipeline.predict(input_texts)]

pipe = ScikitEvalPipeline(text_clf)

We can now pass this pipeline to the evaluator:

from evaluate import evaluator

task_evaluator = evaluator("text-classification")
task_evaluator.compute(pipe, ds["test"], "accuracy")

>>> {'accuracy': 0.82956}

Implementing that simple wrapper is all that’s needed to use any model from any framework with the evaluator. In the __call__ you can implement all logic necessary for efficient forward passes through your model.

Spacy

We’ll use the polarity feature of the spacytextblob project to get a simple sentiment analyzer. First you’ll need to install the project and download the resources:

pip install spacytextblob
python -m textblob.download_corpora
python -m spacy download en_core_web_sm

Then we can simply load the nlp pipeline and add the spacytextblob pipeline:

import spacy

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

This snippet shows how we can use the polarity feature added with spacytextblob to get the sentiment of a text:

texts = ["This movie is horrible", "This movie is awesome"]
results = nlp.pipe(texts)

for txt, res in zip(texts, results):
    print(f"{text} | Polarity: {res._.blob.polarity}")

Now we can wrap it in a simple wrapper class like in the Scikit-Learn example before. It just has to return a list of dictionaries with the predicted lables. If the polarity is larger than 0 we’ll predict positive sentiment and negative otherwise:

class SpacyEvalPipeline:
    def __init__(self, nlp):
        self.nlp = nlp
        self.task = "text-classification"

    def __call__(self, input_texts, **kwargs):
        results =[]
        for p in self.nlp.pipe(input_texts):
            if p._.blob.polarity>=0:
                results.append({"label": 1})
            else:
                results.append({"label": 0})
        return results

pipe = SpacyEvalPipeline(nlp)

That class is compatible with the evaluator and we can use the same instance from the previous examlpe along with the IMDb test set:

eval.compute(pipe, ds["test"], "accuracy")
>>> {'accuracy': 0.6914}

This will take a little longer than the Scikit-Learn example but after roughly 10-15min you will have the evaluation results!

< > Update on GitHub