Unified Contrastive Learning in Image-Text-Label Space

"Unifiled Contrastive Learning in Image-Text-Label Space. CVPR 2022" by Jianwei Yang*, Chunyuan Li*, Pengchuan Zhang*, Bin Xiao*, Ce Liu, Lu Yuan and Jianfeng Gao.

Motivation

In this paper, we introduce a new perspective on commonly used image-label and image-text data by residing them in an image-text-label space. In this space, a new learning paradigm, called Unified Contrastive Learning (UniCL) with a single learning objective is proposed to seamlessly prompt the synergy of two data types. We demonstrate that UniCL is an effective way of learning semantically rich yet discriminative representations, universally for image recognition in zero-shot, linear-probe, fully finetuning and transfer learning scenarios. When scaled up to billions of data, UniCL can exclusively learn a powerful visual-semantic representation supporting dozens of downstream tasks shown in Florence.

Benchmarking

Image-label training augmented by image-text pairs

Model	Training Set	Top-1 on IN-1K	ZS on 14 datasets	Download
Swin-T	IN-1K	79.9	30.2	ckpt/config
Swin-T	IN-1K + GCC-3M	80.2	39.0	ckpt/config
Swin-T	IN-1K + GYFCC-14M	81.1	40.0	ckpt/config
Swin-T	IN-1K + GCC-15M	81.8	45.1	ckpt/config

Note that all the above models are trained without strong data augmentations like mixup and cutmix.

Image-text learning augmented by image-label data

Model	Training Set	ZS on IN-1K	ZS on 14 datasets	Download
Swin-T	YFCC-14M	30.1	36.3	ckpt/config
Swin-T	IN-21K	28.5	37.8	ckpt/config
Swin-T	IN-21K (half) + YFCC-14M (half)	36.4	45.5	ckpt/config
Swin-T	IN-21K + YFCC-14M	40.5	49.1	ckpt/config
Swin-B	YFCC-14M	37.8	-	ckpt/config
Swin-B	IN-21K	29.9	42.4	ckpt/config
Swin-B	IN-21K (half) + YFCC-14M (half)	41.1	48.5	ckpt/config
Swin-B	IN-21K + YFCC-14M	44.3	52.2	ckpt/config
Swin-B	IN-21K + YFCC-14M + GCC-15M	57.9	-	ckpt/config