Losses
Introduction
Before diving into the different losses used to train models like CLIP, it is important to have a clear understanding of what contrastive learning is. Contrastive Learning is an unsupervised deep learning method aimed at representation learning. Its objective is to develop a data representation where similar items are positioned closely in the representation space and dissimilar items are distinctly separated.
In the image below, we have an example where we want to keep the representation from dogs closer to other dogs, but also far from cats.
Training objectives
Constrative Loss
Contrastive loss is one of the first training objectives used for contrastive learning. It takes a pair of samples as input that can be similar or dissimilar, and the objective is to map similar samples close in the embedding space and push dissimilar samples apart.
Technically speaking, imagine that we have a list of input samples from multiple classes. We want a function where examples from the same class have their embeddings close in the embedding space, and examples from different classes are far apart. Translating this to a mathematical equation, what we have is:
Explaining in simple terms:
- If the samples are similar, then we minimize the term that corresponds to their Euclidean distance, i.e., we want to make them closer;
- If the samples are dissimilar, then we minimize the term that is equivalent to maximizing their euclidean distance until some limit, i.e., we want to make them distant from each other.