Open-Source AI Cookbook documentation

Annotate text data using Active Learning with Cleanlab

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

Annotate text data using Active Learning with Cleanlab

Authored by: Aravind Putrevu

In this notebook, I highlight the use of active learning to improve a fine-tuned Hugging Face Transformer for text classification, while keeping the total number of collected labels from human annotators low. When resource constraints prevent you from acquiring labels for the entirety of your data, active learning aims to save both time and money by selecting which examples data annotators should spend their effort labeling.

What is Active Learning?

Active Learning helps prioritize what data to label in order to maximize the performance of a supervised machine learning model trained on the labeled data. This process usually happens iteratively — at each round, active learning tells us which examples we should collect additional annotations for to improve our current model the most under a limited labeling budget. ActiveLab is an active learning algorithm that is particularly useful when the labels coming from human annotators are noisy and when we should collect one more annotation for a previously annotated example (whose label seems suspect) vs. for a not-yet-annotated example. After collecting these new annotations for a batch of data to increase our training dataset, we re-train our model and evaluate its test accuracy.

ActiveLab thumb.webp

In this notebook, I consider a binary text classification task: predicting whether a specific phrase is polite or impolite.

Active learning with ActiveLab is much better than random selection when it comes to collecting additional annotations for Transformer models. It consistently produces much better models with approximately 50% less error rate, regardless of the total labeling budget.

The rest of this notebook walks through the open-source code you can use to achieve these results.

Setting up the environment

!pip install datasets==2.9.0 transformers==4.25.1 scikit-learn==1.1.2 matplotlib==3.5.3 cleanlab
import pandas as pd

pd.set_option("max_colwidth", None)
import numpy as np
import random
import transformers
import datasets
import matplotlib.pyplot as plt

from cleanlab.multiannotator import (
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, Dataset, DatasetDict, ClassLabel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from scipy.special import softmax
from datetime import datetime

Collecting and Organizing Data

Here we download the data that we need for this notebook.

dataset = load_dataset('Cleanlab/stanford-politeness')

!wget -nc -O 'extra_annotations.npy' ''

X_labeled_full = load_dataset("Cleanlab/stanford-politeness", split="labeled")
X_unlabeled = load_dataset("Cleanlab/stanford-politeness", split="unlabeled")
test = load_dataset("Cleanlab/stanford-politeness", split="test")

extra_annotations = np.load("extra_annotations.npy",allow_pickle=True).item()

Classifying the Politeness of Text

We are using Stanford Politeness Corpus as the Dataset.

It is structured as a binary text classification task, to classify whether each phrase is polite or impolite. Human annotators are given a selected text phrase and they provide an (imperfect) annotation regarding its politeness: 0 for impolite and 1 for polite.

Training a Transformer classifier on the annotated data, we measure model accuracy over a set of held-out test examples, where I feel confident about their ground truth labels because they are derived from a consensus amongst 5 annotators who labeled each of these examples.

As for the training data, we have:

  • X_labeled_full: our initial training set with just a small set of 100 text examples labeled with 2 annotations per example.
  • X_unlabeled: large set of 1900 unlabeled text examples we can consider having annotators label.
  • extra_annotations: pool of additional annotations we pull from when an annotation is requested for an example

Visualize Data

# Multi-annotated Data
# Unlabeled Data
# extra_annotations contains the annotations that we will use when an additional annotation is requested.

# Random sample of extra_annotations to see format.
{k: extra_annotations[k] for k in random.sample(extra_annotations.keys(), 5)}

View Some Examples From Test Set

>>> num_to_label = {0: "Impolite", 1: "Polite"}
>>> for i in range(2):
...     print(f"{num_to_label[i]} examples:")
...     subset = test[test.label == i][["text"]].sample(n=3, random_state=2)
...     print(subset)
Impolite examples:

Impolite Examples:

120 And wasting our time as well. I can only repeat: why don’t you do constructive work by adding contents about your beloved Makedonia?
150 Rather than tell me how wrong I was to close certain afd’s maybe your time would be better spent dealing with the current afd backlog <url>. If my decisions were so wrong why haven’t you re-opened them?
326 This was supposed to have been moved to <url> per the CFD. Why wasn’t it moved?

Polite Examples:

498 Hi there, I’ve raised the possibility of unprotecting the tamazepam page <url>. What are your thoughts?
132 Due to certain Edits the page alignment has changed. Could you please help?
131 I’m glad you’re pleased with the general appearance. Before I label all the streets, is the text size, font style, etc OK?

Helper Methods

The following section contains all of the helper methods needed for this notebook.

get_idx_to_label is designed for use in active learning scenarios, particularly when dealing with a mixture of labeled and unlabeled data. Its primary goal is to determine which examples (from both labeled and unlabeled datasets) should be selected for additional annotations based on their active learning scores.

# Helper method to get indices of examples with the lowest active learning score to collect more labels for.
def get_idx_to_label(
    if active_learning_scores_unlabeled is None:
        active_learning_scores_unlabeled = np.array([])

    to_label_idx = []
    to_label_idx_unlabeled = []

    num_labeled = len(active_learning_scores)
    active_learning_scores_combined = np.concatenate((active_learning_scores, active_learning_scores_unlabeled))
    to_label_idx_combined = np.argsort(active_learning_scores_combined)

    # We want to collect the n=batch_size best examples to collect another annotation for.
    i = 0
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        idx = to_label_idx_combined[i]
        # We know this is an already annotated example.
        if idx < num_labeled:
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
        # We know this is an example that is currently not annotated.
            # Subtract off offset to get back original index.
            idx -= num_labeled
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
        i += 1

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

get_idx_to_label_random is designed for an active learning context where the selection of data points for additional annotation is done randomly rather than based on a model’s uncertainty or learning scores. This approach might be used as a baseline to compare against more sophisticated active learning strategies or in scenarios where it’s unclear how to score examples.

# Helper method to get indices of random examples to collect more labels for.
def get_idx_to_label_random(X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label):
    to_label_idx = []
    to_label_idx_unlabeled = []

    # Generate list of indices for both sets of examples.
    labeled_idx = [(x, "labeled") for x in range(len(X_labeled_full))]
    unlabeled_idx = []
    if X_unlabeled is not None:
        unlabeled_idx = [(x, "unlabeled") for x in range(len(X_unlabeled))]
    combined_idx = labeled_idx + unlabeled_idx

    # We want to collect the n=batch_size random examples to collect another annotation for.
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        # Random choice from indices.
        # We time-seed to ensure randomness.
        choice = random.choice(combined_idx)
        idx, which_subset = choice
        # We know this is an already annotated example.
        if which_subset == "labeled":
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
        # We know this is an example that is currently not annotated.
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

Below are some utility methods which helps us to compute standard deviation, selecting a specific annotator who has previously annotated the example, and some token functions to Tokenize text examples.

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev

# Helper method to select which annotator we should collect another annotation from.
def choose_existing(annotators, existing_annotators):
    for annotator in annotators:
        # If we find one that has already given an annotation, we return it.
        if annotator in existing_annotators:
            return annotator
    # If we don't find an existing, just return a random one.
    choice = random.choice(list(annotators.keys()))
    return choice

# Helper method for Trainer.
def compute_metrics(p):
    logits, labels = p
    pred = np.argmax(logits, axis=1)
    pred_probs = softmax(logits, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {"logits": logits, "pred_probs": pred_probs, "accuracy": accuracy}

# Helper method to tokenize text.
def tokenize_function(examples):
    model_name = "distilbert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Helper method to tokenize given dataset.
def tokenize_data(data):
    dataset = Dataset.from_dict({"label": data["label"], "text": data["text"].values})
    tokenized_dataset =, batched=True)
    tokenized_dataset = tokenized_dataset.cast_column("label", ClassLabel(names=["0", "1"]))
    return tokenized_dataset

get_trainer function here is designed to set up a training environment for a text classification task using DistilBERT, a distilled version of the BERT model that is lighter and faster.

# Helper method to initiate a new Trainer with given train and test sets.
def get_trainer(train_set, test_set):

    # Model params.
    model_name = "distilbert-base-uncased"
    model_folder = "model_training"
    max_training_steps = 300
    num_classes = 2

    # Set training args.
    # We time-seed to ensure randomness between different benchmarking runs.
    training_args = TrainingArguments(
        max_steps=max_training_steps, output_dir=model_folder, seed=int(

    # Tokenize train/test set.
    train_tokenized_dataset = tokenize_data(train_set)
    test_tokenized_dataset = tokenize_data(test_set)

    # Initiate a pre-trained model.
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
    trainer = Trainer(
    return trainer

get_pred_probs function performs out-of-sample prediction probability computation for a given dataset using cross-validation, with additional handling for unlabeled data.

# Helper method to manually compute cross-validated predicted probabilities needed for ActiveLab.
def get_pred_probs(X, X_unlabeled):
    """Uses cross-validation to obtain out-of-sample predicted probabilities
    for given dataset"""

    # Generate cross-val splits.
    n_splits = 3
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True)
    skf_splits = [[train_index, test_index] for train_index, test_index in skf.split(X=X["text"], y=X["label"])]

    # Initiate empty array to store pred_probs.
    num_examples, num_classes = len(X), len(X.label.value_counts())
    pred_probs = np.full((num_examples, num_classes), np.NaN)
    pred_probs_unlabeled = None

    # If we use up all examples from the initial unlabeled pool, X_unlabeled will be None.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.full((n_splits, len(X_unlabeled), num_classes), np.NaN)

    # Iterate through cross-validation folds.
    for split_num, split in enumerate(skf_splits):
        train_index, test_index = split

        train_set = X.iloc[train_index]
        test_set = X.iloc[test_index]

        # Get trainer with train/test subsets.
        trainer = get_trainer(train_set, test_set)
        eval_metrics = trainer.evaluate()

        # Get pred_probs and insert into dataframe.
        pred_probs_fold = eval_metrics["eval_pred_probs"]
        pred_probs[test_index] = pred_probs_fold

        # Since we don't have labels for the unlabeled pool, we compute pred_probs at each round of CV
        # and then average the results at the end.
        if X_unlabeled is not None:
            dataset_unlabeled = Dataset.from_dict({"text": X_unlabeled["text"].values})
            unlabeled_tokenized_dataset =, batched=True)
            logits = trainer.predict(unlabeled_tokenized_dataset).predictions
            curr_pred_probs_unlabeled = softmax(logits, axis=1)
            pred_probs_unlabeled[split_num] = curr_pred_probs_unlabeled

    # Here we average the pred_probs from each round of CV to get pred_probs for the unlabeled pool.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.mean(np.array(pred_probs_unlabeled), axis=0)

    return pred_probs, pred_probs_unlabeled

get_annotator function determines the most appropriate annotator to collect a new annotation from for a specific example, based on a set of criteria while get_annotation focused on collecting an actual annotation for a given example from a chosen annotator, it also deletes the collected annotation from the pool to prevent it from being selected again.

# Helper method to determine which annotator to collect annotation from for given example.
def get_annotator(example_id):
    # Update who has already annotated atleast one example.
    existing_annotators = set(X_labeled_full.drop("text", axis=1).columns)
    # Returns the annotator we want to collect annotation from.
    # Chooses existing annotators first.
    annotators = extra_annotations[example_id]
    chosen_annotator = choose_existing(annotators, existing_annotators)
    return chosen_annotator

# Helper method to collect an annotation for given text example.
def get_annotation(example_id, chosen_annotator):

    # Collect new annotation.
    new_annotation = extra_annotations[example_id][chosen_annotator]

    # Remove annotation.
    del extra_annotations[example_id][chosen_annotator]

    return new_annotation

Run the following cell to hide the HTML output from the next model training block.

    div.output_stderr {
    display: none;

Methodology Used

For each active learning round we:

  1. Compute ActiveLab consensus labels for each training example derived from all annotations collected thus far.
  2. Train our Transformer classification model on the current training set using these consensus labels.
  3. Evaluate test accuracy on the test set (which has high-quality ground truth labels).
  4. Run cross-validation to get out-of-sample predicted class probabilities from our model for the entire training set and unlabeled set.
  5. Get ActiveLab active learning scores for each example in the training set and unlabeled set. These scores estimate how informative it would be to collect another annotation for each example.
  6. Select a subset (n = batch_size) of examples with the lowest active learning scores.
  7. Collect one additional annotation for each of the n selected examples.
  8. Add the new annotations (and new previously non-annotated examples if selected) to our training set for the next iteration.

I subsequently compare models trained on data labeled via active learning vs. data labeled via random selection. For each random selection round, I use majority vote consensus instead of ActiveLab consensus (in Step 1) and then just randomly select the n examples to collect an additional label for instead of using ActiveLab scores (in Step 6).

More intuition on Activelab Consensus labels and Active learning scores are shared further in the notebook.


Model Training and Evaluation

I first tokenize my test and train sets, and then initialize a pre-trained DistilBert Transformer model. Fine-tuning DistilBert with 300 training steps produced a good balance between accuracy and training time for my data. This classifier outputs predicted class probabilities which I convert to class predictions before evaluating their accuracy.

Use Active Learning Scores to Decide what to Label Next

During each round of Active Learning, we fit our Transformer model via 3-fold cross-validation on the current training set. This allows us to get out-of-sample predicted class probabilities for each example in the training set and we can also use the trained Transformer to get out-of-sample predicted class probabilities for each example in the unlabeled pool. All of this is internally implemented in the get_pred_probs helper method. The use of out-of-sample predictions helps us avoid bias due to potential overfitting.

Once I have these probabilistic predictions, I pass them into the get_active_learning_scores method from the open-source cleanlab package, which implements the ActiveLab algorithm. This method provides us with scores for all of our labeled and unlabeled data. Lower scores indicate data points for which collecting one additional label should be most informative for our current model (scores are directly comparable between labeled and unlabeled data).

I form a batch of examples with the lowest scores as the examples to collect an annotation for (via the get_idx_to_label method). Here I always collect the exact same number of annotations in each round (under both the active learning and random selection approaches). For this application, I limit the maximum number of annotations per example to 5 (don’t want to spend effort labeling the same example over and over again).

Adding new Annotations

The combined_example_ids are the ids of the text examples we want to collect an annotation for. For each of these, we use the get_annotation helper method to collect a new annotation from an annotator. Here, we prioritize selecting annotations from annotators who have already annotated another example. If none of the annotators for the given example exist in the training set, we randomly select one. In this case, we add a new column to our training set which represents the new annotator. Finally, we add the newly collected annotation to the training set. If the corresponding example was previously non-annotated, we also add it to the training set and remove it from the unlabeled collection.

We’ve now completed one round of collecting new annotations and retrain the Transformer model on the updated training set. We repeat this process in multiple rounds to keep growing the training dataset and improving our model.

# For this Active Learning demo, we add 25 additional annotations to the training set
# each iteration, for 25 rounds.
num_rounds = 25
batch_size_to_label = 25
model_accuracy_arr = np.full(num_rounds, np.nan)

# The 'selection_method' varible determines if we use ActiveLab or random selection
# to choose the new annotations each round.
selection_method = "random"
# selection_method = 'active_learning'

# Each round we:
# - train our model
# - evaluate on unchanging test set
# - collect and add new annotations to training set
for i in range(num_rounds):

    # X_labeled_full is updated each iteration. We drop the text column which leaves us with just the annotations.
    multiannotator_labels = X_labeled_full.drop(["text"], axis=1)

    # Use majority vote when using random selection to select the consensus label for each example.
    if i == 0 or selection_method == "random":
        consensus_labels = get_majority_vote_label(multiannotator_labels)

    # When using ActiveLab, use cleanlab's CrowdLab to select the consensus label for each example.
        results = get_label_quality_multiannotator(
        consensus_labels = results["label_quality"]["consensus_label"].values

    # We only need the text and label columns.
    train_set = X_labeled_full[["text"]]
    train_set["label"] = consensus_labels
    test_set = test[["text", "label"]]

    # Train our Transformer model on the full set of labeled data to evaluate model accuracy for the current round.
    # This is an optional step for demonstration purposes, in practical applications
    # you may not have ground truth labels.
    trainer = get_trainer(train_set, test_set)
    eval_metrics = trainer.evaluate()
    # set statistics
    model_accuracy_arr[i] = eval_metrics["eval_accuracy"]

    # For ActiveLab, we need to run cross-validation to get out-of-sample predicted probabilites.
    if selection_method == "active_learning":
        pred_probs, pred_probs_unlabeled = get_pred_probs(train_set, X_unlabeled)

        # Compute active learning scores.
        active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
            multiannotator_labels, pred_probs, pred_probs_unlabeled

        # Get the indices of examples to collect more labels for.
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label(

    # We don't need to run cross-validation, just get random examples to collect annotations for.
    if selection_method == "random":
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label_random(
            X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label

    unlabeled_example_ids = np.array([])
    # Check to see if we still have unlabeled examples left.
    if X_unlabeled is not None:
        # Get unlabeled text examples we want to collect annotations for.
        new_text = X_unlabeled.iloc[chosen_examples_unlabeled]
        unlabeled_example_ids = new_text.index.values
        num_ex, num_annot = len(new_text), multiannotator_labels.shape[1]
        empty_annot = pd.DataFrame(
            data=np.full((num_ex, num_annot), np.NaN),
        new_unlabeled_df = pd.concat([new_text, empty_annot], axis=1)

        # Combine unlabeled text examples with existing, labeled examples.
        X_labeled_full = pd.concat([X_labeled_full, new_unlabeled_df], axis=0)

        # Remove examples from X_unlabeled and check if empty.
        # Once it is empty we set it to None to handle appropriately elsewhere.
        X_unlabeled = X_unlabeled.drop(new_text.index)
        if X_unlabeled.empty:
            X_unlabeled = None

    if selection_method == "active_learning":
        # Update pred_prob arrays with newly added examples if necessary.
        if pred_probs_unlabeled is not None and len(chosen_examples_unlabeled) != 0:
            pred_probs_new = pred_probs_unlabeled[chosen_examples_unlabeled, :]
            pred_probs_labeled = np.concatenate((pred_probs, pred_probs_new))
            pred_probs_unlabeled = np.delete(pred_probs_unlabeled, chosen_examples_unlabeled, axis=0)
        # Otherwise we have nothing to modify.
            pred_probs_labeled = pred_probs

    # Get combined list of text ID's to relabel.
    labeled_example_ids = X_labeled_full.iloc[chosen_examples_labeled].index.values
    combined_example_ids = np.concatenate([labeled_example_ids, unlabeled_example_ids])

    # Now we collect annotations for the selected examples.
    for example_id in combined_example_ids:
        # Choose which annotator to collect annotation from.
        chosen_annotator = get_annotator(example_id)
        # Collect new annotation.
        new_annotation = get_annotation(example_id, chosen_annotator)
        # New annotator has been selected.
        if chosen_annotator not in X_labeled_full.columns.values:
            empty_col = np.full((len(X_labeled_full),), np.nan)
            X_labeled_full[chosen_annotator] = empty_col

        # Add selected annotation to the training set.[example_id, chosen_annotator] = new_annotation


After running 25 rounds of active learning (labeling batches of data and retraining the Transformer model), collecting 25 annotations in each round. I repeated all of this, the next time using random selection to choose which examples to annotate in each round — as a baseline comparison. Before additional data are annotated, both approaches start with the same initial training set of 100 examples (hence achieving roughly the same Transformer accuracy in the first round). Because of inherent stochasticity in training Transformers, I ran this entire process five times (for each data labeling strategy) and report the standard deviation (shaded region) and mean (solid line) of test accuracies across the five replicate runs.

# Get numpy array of results.
!wget -nc -O 'random_acc.npy' ''
!wget -nc -O 'activelearn_acc.npy' ''
# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev
>>> al_acc = np.load("activelearn_acc.npy")
>>> rand_acc = np.load("random_acc.npy")

>>> rand_acc_std = compute_std_dev(rand_acc)
>>> al_acc_std = compute_std_dev(al_acc)

>>> plt.plot(range(1, al_acc.shape[1] + 1), np.mean(al_acc, axis=0), label="active learning", color="green")
>>> plt.fill_between(range(1, al_acc.shape[1] + 1), al_acc_std[0], al_acc_std[1], alpha=0.3, color="green")

>>> plt.plot(range(1, rand_acc.shape[1] + 1), np.mean(rand_acc, axis=0), label="random", color="red")
>>> plt.fill_between(range(1, rand_acc.shape[1] + 1), rand_acc_std[0], rand_acc_std[1], alpha=0.1, color="red")

>>> plt.hlines(y=0.9, xmin=1.0, xmax=25.0, color="black", linestyle="dotted")
>>> plt.legend()
>>> plt.xlabel("Round Number")
>>> plt.ylabel("Test Accuracy")
>>> plt.title("ActiveLab vs Random Annotation Selection --- 5 Runs")
>>> plt.savefig("al-results.png")

We see that choosing what data to annotate next has drastic effects on model performance. Active learning using ActiveLab consistently outperforms random selection by a significant margin at each round. For example, in round 4 with 275 total annotations in the training set, we obtain 91% accuracy via active learning vs. only 76% accuracy without a clever selection strategy of what to annotate. Overall, the resulting Transformer models fit on the dataset constructed via active learning have around 50% of the error-rate, no matter the total labeling budget!

When labeling data for text classification, you should consider active learning with the re-labeling option to better account for imperfect annotators.

< > Update on GitHub