Community Computer Vision Course documentation

Vision Transformers for Object Detection

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Vision Transformers for Object Detection

This section will describe how object detection tasks are achieved using Vision Transformers. We will understand how to fine-tune existing pre-trained object detection models for our use case. Before starting, check out this HuggingFace Space, where you can play around with the final output.

Introduction

Object detection example

Object detection is a computer vision task that involves identifying and localizing objects within an image or video. It consists of two main steps:

  • First, recognizing the types of objects present (such as cars, people, or animals),
  • Second, determining their precise locations by drawing bounding boxes around them.

These models typically receive images (static or frames from videos) as their inputs, with multiple objects present in each image. For example, consider an image containing several objects such as cars, people, bicycles, and so on. Upon processing the input, these models produce a set of numbers that convey the following information:

  • Location of the object (XY coordinates of the bounding box)
  • Class of the object.

There are a lot of of applications around object detection. One of the most significant examples is in the field of autonomous driving, where object detection is used to detect different objects (like pedestrians, road signs, traffic lights, etc) around the car that become one of the inputs for taking decisions.

To deepen your understanding of the ins-and-outs of object detection, check out our dedicated chapter on Object Detection 🤗

The Need to Fine-tune Models in Object Detection 🤔

That is an awesome question. Training an object detection model from scratch means:

  • Doing already done research over and over again.
  • Writing repetitive model code, training them, and maintaining different repositories for different use cases.
  • A lot of experimentation and waste of resources.

Rather than doing all this, take a well-performing pre-trained model (a model which that does an awesome job in recognizing general features), and tweak or re-tune its weights (or some part of its weights) to adapt it for your use case. We believe or assume that the pre-trained model has already learned enough to extract significant features inside an image to locate and classify objects. So, if new objects are introduced, then the same model can be trained for a small period of time and compute to start detecting those new objects with the help of already learned and new features.

By the end of this tutorial, you should be able to make a full pipeline (from loading datasets, fine-tuning a model and doing inference) for object detection use case.

Installing Necessary Libraries

Let’s start with installation. Just execute the below cells to install the necessary packages. For this tutorial, we will be using Hugging Face Transformers and PyTorch.

!pip install -U -q datasets transformers[torch] evaluate timm albumentations accelerate

Scenario

To make this tutorial interesting, let’s consider a real-world example. Consider this scenario: construction workers require the utmost safety when working in construction areas. Basic safety protocol requires wearing a helmet every time. Since there are many construction workers, it is hard to keep and eye on everyone every time.

But, if we can have a camera system that can detect persons and whether the person is wearing a helmet or not in real-time, that would be awesome, right?

So, we are going to fine-tune a lightweight object detection model for doing just that. Let’s dive in.

Dataset

For the above scenario, we will use the hardhat dataset provided by Northeaster University China. We can download and load this dataset with 🤗 datasets.

from datasets import load_dataset

dataset = load_dataset("anindya64/hardhat")
dataset

This will give you the following data structure:

DatasetDict({
    train: Dataset({
        features: ['image', 'image_id', 'width', 'height', 'objects'],
        num_rows: 5297
    })
    test: Dataset({
        features: ['image', 'image_id', 'width', 'height', 'objects'],
        num_rows: 1766
    })
})

Above is a DatasetDict, which is an efficient dict-like structure containing the whole dataset in train and test splits. As you can see, under each split (train and test), we have features and num_rows. Under features, we have the image, a Pillow Object, the id of the image, height and width, and objects. Now let’s see what each datapoint (in train/test set) looks like. To do that, run the following line:

dataset["train"][0]

And this will give you the following structure:

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x375>,
 'image_id': 1,
 'width': 500,
 'height': 375,
 'objects': {'id': [1, 1],
  'area': [3068.0, 690.0],
  'bbox': [[178.0, 84.0, 52.0, 59.0], [111.0, 144.0, 23.0, 30.0]],
  'category': ['helmet', 'helmet']}}

As you see, objects is an another dict containing the object ids (which are the class ids here), the area of the objects, and the bounding box coordinates (bbox) and the category (or the label). Here is a more detailed explaination of each of the keys and values of a data element.

  • image: This is a Pillow Image object that helps to look into the image directly before even loading from the path.
  • image_id: Denotes which number of images is from the train file.
  • width: The width of the image.
  • height: The height of the image.
  • objects: Another dictionary containing information about annotation. This contains the following:
    • id: A list, where the length of the list denotes the number of objects and the value of each denotes the class index.
    • area: The area of the object.
    • bbox: Denotes bounding box coordinates of the object.
    • category: The class (string) of the object.

Now let’s properly extract the train and test samples. For this tutorial, we have around 5000 training samples and 1700 test samples.

# First, extract out the train and test set

train_dataset = dataset["train"]
test_dataset = dataset["test"]

Now that we know what a sample data point contains, let’s start by plotting that sample. Here we are going to first draw the image and then also draw the corresponding bounding box.

Here is what we are going to do:

  1. Get the image and its corresponding height and width.
  2. Make a draw object that can easily draw text and lines on image.
  3. Get the annotations dict from the sample.
  4. Iterate over it.
  5. For each, get the bounding box co-ordinates, which are x (where the bounding box starts horizontally), y (where the bounding box starts vertically), w (width of the bounding box), h (height of the bounding box).
  6. Now if the bounding box measures are normalized then scale it, else leave it.
  7. And finally draw the rectangle and the the class category text.
import numpy as np
from PIL import Image, ImageDraw


def draw_image_from_idx(dataset, idx):
    sample = dataset[idx]
    image = sample["image"]
    annotations = sample["objects"]
    draw = ImageDraw.Draw(image)
    width, height = sample["width"], sample["height"]

    for i in range(len(annotations["id"])):
        box = annotations["bbox"][i]
        class_idx = annotations["id"][i]
        x, y, w, h = tuple(box)
        if max(box) > 1.0:
            x1, y1 = int(x), int(y)
            x2, y2 = int(x + w), int(y + h)
        else:
            x1 = int(x * width)
            y1 = int(y * height)
            x2 = int((x + w) * width)
            y2 = int((y + h) * height)
        draw.rectangle((x1, y1, x2, y2), outline="red", width=1)
        draw.text((x1, y1), annotations["category"][i], fill="white")
    return image


draw_image_from_idx(dataset=train_dataset, idx=10)

We have a function to plot one single image, let’s write a simple function using the above to plot multiple images. This will help us with some analysis.

import matplotlib.pyplot as plt


def plot_images(dataset, indices):
    """
    Plot images and their annotations.
    """
    num_rows = len(indices) // 3
    num_cols = 3
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))

    for i, idx in enumerate(indices):
        row = i // num_cols
        col = i % num_cols

        # Draw image
        image = draw_image_from_idx(dataset, idx)

        # Display image on the corresponding subplot
        axes[row, col].imshow(image)
        axes[row, col].axis("off")

    plt.tight_layout()
    plt.show()


# Now use the function to plot images

plot_images(train_dataset, range(9))

Running the function will give us a beautiful collage shown below.

input-image-plot

AutoImageProcessor

Before fine-tuning the model, we must preprocess the data in such a way that it matches exactly with the approach used during the time of pre-training. HuggingFace AutoImageProcessor takes care of processing the image data to create pixel_values, pixel_mask, and labels that a DETR model can train with.

Now, let us instantiate the image processor from the same checkpoint we want to use our model to fine-tune.

from transformers import AutoImageProcessor

checkpoint = "facebook/detr-resnet-50-dc5"
image_processor = AutoImageProcessor.from_pretrained(checkpoint)

Preprocessing the Dataset

Before passing the images to the image_processor, let’s also apply different types of augmentations to the images along with their corresponding bounding boxes.

In simple terms, augmentations are some set of random transformations like rotations, resizing etc. These are applied to get more samples and to make the vision model more robust towards different conditions of the image. We will use the albumentations library to achieve this. It let’s you to create random transformations of the images so that your sample size increases for training.

import albumentations
import numpy as np
import torch

transform = albumentations.Compose(
    [
        albumentations.Resize(480, 480),
        albumentations.HorizontalFlip(p=1.0),
        albumentations.RandomBrightnessContrast(p=1.0),
    ],
    bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
)

Once we initialize all the transformations, we need to make a function which formats the annotations and returns the a list of annotation with a very specific format.

This is because the image_processor expects the annotations to be in the following format: {'image_id': int, 'annotations': List[Dict]}, where each dictionary is a COCO object annotation.

def formatted_anns(image_id, category, area, bbox):
    annotations = []
    for i in range(0, len(category)):
        new_ann = {
            "image_id": image_id,
            "category_id": category[i],
            "isCrowd": 0,
            "area": area[i],
            "bbox": list(bbox[i]),
        }
        annotations.append(new_ann)

    return annotations

Finally, we combine the image and annotation transformations to do transformations over the whole batch of dataset.

Here is the final code to do so:

# transforming a batch


def transform_aug_ann(examples):
    image_ids = examples["image_id"]
    images, bboxes, area, categories = [], [], [], []
    for image, objects in zip(examples["image"], examples["objects"]):
        image = np.array(image.convert("RGB"))[:, :, ::-1]
        out = transform(image=image, bboxes=objects["bbox"], category=objects["id"])

        area.append(objects["area"])
        images.append(out["image"])
        bboxes.append(out["bboxes"])
        categories.append(out["category"])

    targets = [
        {"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)}
        for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
    ]

    return image_processor(images=images, annotations=targets, return_tensors="pt")

Finally, all you have to do is apply this preprocessing function to the entire dataset. You can achieve this by using HuggingFace 🤗 Datasets with transform method.

# Apply transformations for both train and test dataset

train_dataset_transformed = train_dataset.with_transform(transform_aug_ann)
test_dataset_transformed = test_dataset.with_transform(transform_aug_ann)

Now let’s see how a transformed train dataset sample looks like:

train_dataset_transformed[0]

This will return a dictionary of tensors. What we mainly require here is the pixel_values which represent the image, pixel_mask which is the attention masks and the labels. Here is one data point looks like:

{'pixel_values': tensor([[[-0.1657, -0.1657, -0.1657,  ..., -0.3369, -0.4739, -0.5767],
          [-0.1657, -0.1657, -0.1657,  ..., -0.3369, -0.4739, -0.5767],
          [-0.1657, -0.1657, -0.1828,  ..., -0.3541, -0.4911, -0.5938],
          ...,
          [-0.4911, -0.5596, -0.6623,  ..., -0.7137, -0.7650, -0.7993],
          [-0.4911, -0.5596, -0.6794,  ..., -0.7308, -0.7993, -0.8335],
          [-0.4911, -0.5596, -0.6794,  ..., -0.7479, -0.8164, -0.8507]],
 
         [[-0.0924, -0.0924, -0.0924,  ...,  0.0651, -0.0749, -0.1800],
          [-0.0924, -0.0924, -0.0924,  ...,  0.0651, -0.0924, -0.2150],
          [-0.0924, -0.0924, -0.1099,  ...,  0.0476, -0.1275, -0.2500],
          ...,
          [-0.0924, -0.1800, -0.3200,  ..., -0.4426, -0.4951, -0.5301],
          [-0.0924, -0.1800, -0.3200,  ..., -0.4601, -0.5126, -0.5651],
          [-0.0924, -0.1800, -0.3200,  ..., -0.4601, -0.5301, -0.5826]],
 
         [[ 0.1999,  0.1999,  0.1999,  ...,  0.6705,  0.5136,  0.4091],
          [ 0.1999,  0.1999,  0.1999,  ...,  0.6531,  0.4962,  0.3916],
          [ 0.1999,  0.1999,  0.1825,  ...,  0.6356,  0.4614,  0.3568],
          ...,
          [ 0.4788,  0.3916,  0.2696,  ...,  0.1825,  0.1302,  0.0953],
          [ 0.4788,  0.3916,  0.2696,  ...,  0.1651,  0.0953,  0.0605],
          [ 0.4788,  0.3916,  0.2696,  ...,  0.1476,  0.0779,  0.0431]]]),
 'pixel_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': {'size': tensor([800, 800]), 'image_id': tensor([1]), 'class_labels': tensor([1, 1]), 'boxes': tensor([[0.5920, 0.3027, 0.1040, 0.1573],
         [0.7550, 0.4240, 0.0460, 0.0800]]), 'area': tensor([8522.2217, 1916.6666]), 'iscrowd': tensor([0, 0]), 'orig_size': tensor([480, 480])}}

We are almost there 🚀. As a last preprocessing step, we need to write a custom collate_fn. Now what is a collate_fn ?

A collate_fn is responsible for taking a list of samples from a dataset and converting them into a batch suitable for model’s input format.

In general a DataCollator typically performs tasks such as padding, truncating etc. In a custom collate function, we often define what and how we want to group the data into batches or simply, how to represent each batch.

Data collator mainly puts the data together and then preprocesses them. Let’s make our collate function.

def collate_fn(batch):
    pixel_values = [item["pixel_values"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    labels = [item["labels"] for item in batch]
    batch = {}
    batch["pixel_values"] = encoding["pixel_values"]
    batch["pixel_mask"] = encoding["pixel_mask"]
    batch["labels"] = labels
    return batch

Training a DETR Model.

So, all the heavy lifting is done so far. Now, all that is left is to assemble each part of the puzzle one by one. Let’s go!

The training procedure involves the following steps:

  1. Loading the base (pre-trained) model with AutoModelForObjectDetection using the same checkpoint as in the preprocessing.

  2. Defining all the hyperparameters and additional arguments inside TrainingArguments.

  3. Pass the training arguments inside HuggingFace Trainer, along with the model, dataset and image.

  4. Call the train() method and fine-tune your model.

When loading the model from the same checkpoint that you used for the preprocessing, remember to pass the label2id and id2label maps that you created earlier from the dataset’s metadata. Additionally, we specify ignore_mismatched_sizes=True to replace the existing classification head with a new one.

from transformers import AutoModelForObjectDetection

id2label = {0: "head", 1: "helmet", 2: "person"}
label2id = {v: k for k, v in id2label.items()}


model = AutoModelForObjectDetection.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

Before proceeding further, log in to Hugging Face Hub to upload your model on the fly while training. In this way, you do not need to handle the checkpoints and save them somewhere.

from huggingface_hub import notebook_login

notebook_login

Once done, let’s start training the model. We start by defining the training arguments and defining a trainer object that uses those arguments to do the training, as shown here:

from transformers import TrainingArguments
from transformers import Trainer

# Define the training arguments

training_args = TrainingArguments(
    output_dir="detr-resnet-50-hardhat-finetuned",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    max_steps=1000,
    fp16=True,
    save_steps=10,
    logging_steps=30,
    learning_rate=1e-5,
    weight_decay=1e-4,
    save_total_limit=2,
    remove_unused_columns=False,
    push_to_hub=True,
)

# Define the trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_dataset_transformed,
    eval_dataset=test_dataset_transformed,
    tokenizer=image_processor,
)

trainer.train()

Once training is finished, you can now delete the model, because checkpoints are already uploaded in HuggingFace Hub.

del model
torch.cuda.synchronize()

Testing and Inference

Now we will try to do inference of our new fine-tuned model. For this tutorial, we will be testing for this image:

input-test-image

Here we first write a very simple code on doing inference for object detection for some new images. We start of with inferencing for one single image and after that we will club togather everything up and make a function out of it.

import requests
from transformers import pipeline

# download a sample image

url = "https://huggingface.co/datasets/hf-vision/course-assets/blob/main/test-helmet-object-detection.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# make the object detection pipeline

obj_detector = pipeline(
    "object-detection", model="anindya64/detr-resnet-50-dc5-hardhat-finetuned"
)
results = obj_detector(train_dataset[0]["image"])

print(results)

Now let’s make a very simple function to plot the results on our image. We get score, label and corresponding bounding boxes co-ordinates from results, which we will we use to draw in the image.

def plot_results(image, results, threshold=0.7):
    image = Image.fromarray(np.uint8(image))
    draw = ImageDraw.Draw(image)
    for result in results:
        score = result["score"]
        label = result["label"]
        box = list(result["box"].values())
        if score > threshold:
            x, y, x2, y2 = tuple(box)
            draw.rectangle((x, y, x2, y2), outline="red", width=1)
            draw.text((x, y), label, fill="white")
            draw.text(
                (x + 0.5, y - 0.5),
                text=str(score),
                fill="green" if score > 0.7 else "red",
            )
    return image

And finally use this function for the same test image we used.

plot_results(image, results)

And this will plot the output below:

output-test-image-plot

Now, let’s club everything together into a simple function.

def predict(image, pipeline, threshold=0.7):
    results = pipeline(image)
    return plot_results(image, results, threshold)


# Let's test for another test image

img = test_dataset[0]["image"]
predict(img, obj_detector)

Let’s even plot multiple images using our inference function on a small test sample.

from tqdm.auto import tqdm


def plot_images(dataset, indices):
    """
    Plot images and their annotations.
    """
    num_rows = len(indices) // 3
    num_cols = 3
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))

    for i, idx in tqdm(enumerate(indices), total=len(indices)):
        row = i // num_cols
        col = i % num_cols

        # Draw image
        image = predict(dataset[idx]["image"], obj_detector)

        # Display image on the corresponding subplot
        axes[row, col].imshow(image)
        axes[row, col].axis("off")

    plt.tight_layout()
    plt.show()


plot_images(test_dataset, range(6))

Running this function will give us an output like this:

test-sample-output-plot

Well, that’s not bad. We can improve the results if we fine-tune further. You can find this fine-tuned checkpoint here.

< > Update on GitHub