Open-Source AI Cookbook documentation

Suggestions for Data Annotation with SetFit in Zero-shot Text Classification

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Suggestions for Data Annotation with SetFit in Zero-shot Text Classification

Authored by: David Berenstein and Sara Han Díaz

Suggestions are a wonderful way to make things easier and faster for your annotation team. These preselected options will make the labeling process more efficient, as they will only need to correct the suggestions. In this example, we will demonstrate how to implement a zero-shot approach using SetFit to get some initial suggestions for a dataset in Argilla that combines two text classification tasks that include a LabelQuestion and a MultiLabelQuestion.

Argilla is an open-source data curation platform, designed to enhance the development of both small and large language models (LLMs). Using Argilla, everyone can build robust language models through faster data curation using both human and machine feedback. So, it provides support for each step in the MLOps cycle, from data labeling to model monitoring.

Feedback is a crucial part of the data curation process and Argilla also provides a way to manage and visualize it, so that the curated data can be later used to improve a language model. In this tutorial, we will show a real example of how to make our annotators’ job easier by providing them with suggestions. To achieve this, you will learn how to train zero-shot sentiment and topic classifiers using SetFit and then use them to suggest labels for the dataset.

In this tutorial, we will follow these steps:

Create a dataset in Argilla.
Train the zero-shot classifiers using SetFit.
Get suggestions for the dataset using the trained classifiers.
Visualize the suggestions in Argilla.

Let’s get started!

Setup

For this tutorial, you will need to have an Argilla server running. If you don’t have one already, check out our Quickstart or Installation pages. Once you do, complete the following steps:

Install the Argilla client and the required third-party libraries using pip:

!pip install argilla setfit

Make the necessary imports:

import argilla as rg
from datasets import load_dataset
from setfit import get_templated_dataset
from setfit import SetFitModel, SetFitTrainer

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(api_url="http://localhost:6900", api_key="admin.apikey", workspace="admin")

If you’re running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:

# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="admin.apikey",
#     workspace="admin",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

Configure the dataset

In this example, we will load the banking77 dataset, a popular open-source dataset that has customer requests in the banking domain.

data = load_dataset("PolyAI/banking77", split="test")

Argilla works with the FeedbackDataset, which easily enables you to create a dataset and manage the data and feedback. The FeedbackDataset has first to be configured by indicating the two main components (although more can be added): the fields where the data to be annotated will be added and the questions for the annotators. For more information about the FeedbackDataset and the optional components, check the Argilla documentation and our end-to-end tutorials.

You can also create one straight away using the default Templates.

In this case, we will configure a custom dataset with two different questions so that we can work with two text classification tasks at the same time. We will load the original labels of this dataset to make a multi-label classification of the topics mentioned in the request and we will also set up a question to classify the sentiment of the request as either “positive”, “neutral” or “negative”.

dataset = rg.FeedbackDataset(
    fields=[rg.TextField(name="text")],
    questions=[
        rg.MultiLabelQuestion(
            name="topics",
            title="Select the topic(s) of the request",
            labels=data.info.features["label"].names,  # these are the original labels present in the dataset
            visible_labels=10,
        ),
        rg.LabelQuestion(
            name="sentiment", title="What is the sentiment of the message?", labels=["positive", "neutral", "negative"]
        ),
    ],
)

Train the models

Now we will use the data we loaded and the labels and questions we configured for our dataset to train a zero-shot text classification model for each of the questions in our dataset. As mentioned in previous sections, we will use the SetFit framework for few-shot fine-tuning of Sentence Transformers in both classifiers. In addition, the model we will use is all-MiniLM-L6-v2, a sentence embedding model fine-tuned on a 1B sentence pairs dataset using a contrastive objective.

def train_model(question_name, template, multi_label=False):
    # build a training dataset that uses the labels of a specific question in our Argilla dataset
    train_dataset = get_templated_dataset(
        candidate_labels=dataset.question_by_name(question_name).labels,
        sample_size=8,
        template=template,
        multi_label=multi_label,
    )

    # train a model using the training dataset we just built
    if multi_label:
        model = SetFitModel.from_pretrained("all-MiniLM-L6-v2", multi_target_strategy="one-vs-rest")
    else:
        model = SetFitModel.from_pretrained("all-MiniLM-L6-v2")

    trainer = SetFitTrainer(model=model, train_dataset=train_dataset)
    # Temporary workaround https://github.com/huggingface/setfit/issues/512
    trainer.args.eval_strategy = trainer.args.evaluation_strategy
    trainer.train()
    return model

topic_model = train_model(question_name="topics", template="The customer request is about {}", multi_label=True)

sentiment_model = train_model(question_name="sentiment", template="This message is {}", multi_label=False)

Make predictions

Once the training step is over, we can make predictions over our data.

def get_predictions(texts, model, question_name):
    probas = model.predict_proba(texts, as_numpy=True)
    labels = dataset.question_by_name(question_name).labels
    for pred in probas:
        yield [{"label": label, "score": score} for label, score in zip(labels, pred)]

data = data.map(
    lambda batch: {
        "topics": list(get_predictions(batch["text"], topic_model, "topics")),
        "sentiment": list(get_predictions(batch["text"], sentiment_model, "sentiment")),
    },
    batched=True,
)

data.to_pandas().head()

Build records and push

With the data and the predictions we have produced, now we can build records (each of the data items that will be annotated by the annotator team) that include the suggestions from our models. In the case of the LabelQuestion we will use the label that received the highest probability score and for the MultiLabelQuestion we will include all labels with a score above a certain threshold. In this case, we decided to go for 2/len(labels), but you can experiment with your data and decide to go for a more restrictive or more lenient threshold.

Note that more lenient thresholds (closer or equal to 1/len(labels)) will suggest more labels and restrictive thresholds (between 2 and 3) will select fewer (or no) labels.

def add_suggestions(record):
    suggestions = []

    # get label with max score for sentiment question
    sentiment = max(record["sentiment"], key=lambda x: x["score"])["label"]
    suggestions.append({"question_name": "sentiment", "value": sentiment})

    # get all labels above a threshold for topics questions
    threshold = 2 / len(dataset.question_by_name("topics").labels)
    topics = [label["label"] for label in record["topics"] if label["score"] >= threshold]
    # apply the suggestion only if at least one label was over the threshold
    if topics:
        suggestions.append({"question_name": "topics", "value": topics})
    return suggestions

records = [rg.FeedbackRecord(fields={"text": record["text"]}, suggestions=add_suggestions(record)) for record in data]

Once we are happy with the result, we can add the records to the dataset that we configured above. Finally, to visualize it and start annotating, you need to push it to Argilla. This means adding your dataset to the running Argilla server and making it available for the annotators.

dataset.add_records(records)

dataset.push_to_argilla("setfit_tutorial", workspace="admin")

This is how the UI will look like with the suggestions from our models:

Feedback Task dataset with suggestions made using SetFit

Optionally, you can also save and load your FeedbackDataset into the Hugging Face Hub. Refer to the documentation for more information on how to do this.

# Push to HuggingFace Hub
dataset.push_to_huggingface("argilla/my-dataset")

# Load a public dataset
dataset = rg.FeedbackDataset.from_huggingface("argilla/my-dataset")

Conclusion

In this tutorial, we have covered how to add suggestions to a Feedback Task dataset using a zero-shot approach with the SetFit library. This will help with the efficiency of the labelling process by lowering the number of decisions and edits that the annotation team must make.

To learn more about SetFit check out these links:

< > Update on GitHub

←Advanced RAG on HuggingFace documentation using LangChain Fine-tuning a Code LLM on Custom Code on a single GPU→