Image Segmentation

Image segmentation is dividing an image into meaningful segments. It’s all about creating masks that spotlight each object in the picture. The intuition behind this task is that it can be viewed as a classification for each pixel of the image. Segmentation models are the core models in various industries. They can be found in agriculture and autonomous driving. In the farming world, these models are used for identifying different land sections and assessing the growth stage of crops. They’re also key players for self-driving cars, where they are used to identify lanes, sidewalks, and other road users.

Image segmentation

Different types of segmentations can be applied depending on the context and the intended goal. The most commonly defined segmentations are the following.

Semantic Segmentation: This involves assigning the most probable class to each pixel. For example, in semantic segmentation, the model does not distinguish between two individual cats but rather focuses on the pixel class. It’s all about classification of each pixel.
Instance Segmentation: This type involves identifying each instance of an object with a unique mask. It combines aspects of object detection and segmentation to differentiate between individual objects of the same class.
Panoptic Segmentation: A hybrid approach that combines elements of semantic and instance segmentation. It assigns a class and an instance to each pixel, effectively integrating the what and where aspects of the image.

Comparison of segmentation types

Choosing the right segmentation type depends on the context and the intended goal. One cool thing is that recent models allow you to achieve the three segmentation types with a single model. We recommend you to check out this article, which introduces Mask2former, a new model by Meta that achieves the three segmentation types with only a Panoptic dataset.

Modern Approach: Vision Transformer-based Segmentation

You’ve probably heard of U-Net, a popular network used for image segmentation. It’s designed with several convolutional layers and works in two main phases: the downsampling phase, which compresses the image to understand its features, and the upsampling phase, which expands the image back to its original size for detailed segmentation.

Computer vision was once dominated by convolutional models, but it has recently shifted towards the vision transformer approach. An example is Segment anything model (SAM) that is a popular prompt based model introduced in April 2023 by Meta AI Research, FAIR. The model is based on the Vision Transformer (ViT) model and focuses on creating a promptable (i.e. you can provide words to describe what you would like to segment in the image) segmentation model capable of zero-shot transfer on new images. The strength of the model comes from its training on the largest dataset available, which includes over 1 billion masks on 11 million images. I recommend you play with Meta’s demo on a few images and even better you can play with the model in transformers.

Here is an example of how to use the model in transformers. First, we will initialize the mask-generation pipeline. Then, we will pass the image in pipeline for inference.

from transformers import pipeline

pipe = pipeline("mask-generation", model="facebook/sam-vit-base", device=0)

raw_image = Image.open("path/to/image").convert("RGB")

masks = pipe(raw_image)

More details on how to use the model can be found in the documentation.

How to Evaluate a Segmentation Model?

You have now seen how to use a segmentation model, but how can you evaluate it? As demonstrated in the previous section, segmentation is primarily a supervised learning task. This means that the dataset is composed of images and their corresponding masks, which serve as the ground truth. A few metrics can be used to evaluate your model. The most common ones are:

The Intersection over Union (IoU) or Jaccard index metric is the ratio between the intersection and the union of the predicted mask and the ground truth. IoU is arguably the most common metric used in segmentation tasks. Its advantage lies in being less sensitive to class imbalance, making it often a good choice when you begin modeling.

IoU

Pixel accuracy: Pixel accuracy is calculated as the ratio of the number of correctly classified pixels to the total number of pixels. While being an intuitive metric, it can be misleading due to its sensitivity to class imbalance.

Dice coefficient: It’s the ratio between the double of the intersection and the sum of the predicted mask and the ground truth. The dice coefficient is simply the percentage of overlap between the prediction and the ground truth. It’s a good metric to use when you need sensibility to small differences between the overlap.

Dice coefficient

Resources and Further Reading

Segment Anything Paper
Fine-tuning Segformer blog post
Mask2former blog post
Hugging Face’s documentation on segmentation tasks
If you want to go deeper into the topic, we recommend you to check out Stanford’s lecture on segmentation.

< > Update on GitHub

Community Computer Vision Course

Image Segmentation

Modern Approach: Vision Transformer-based Segmentation

How to Evaluate a Segmentation Model?

Resources and Further Reading