Spaces:

CVPR
/

unicl-zero-shot-img-recog

Build error

App Files Files Community

jw2yang commited on Jun 15, 2022

Commit

1500f04

•

1 Parent(s): 41fcb47

Update docs/intro.md

Browse files

Files changed (1) hide show

docs/intro.md +1 -31

docs/intro.md CHANGED Viewed

@@ -1,34 +1,4 @@
-# [Unified Contrastive Learning in Image-Text-Label Space](https://arxiv.org/abs/2204.03610)
 ["**Unifiled Contrastive Learning in Image-Text-Label Space. CVPR 2022**"](https://arxiv.org/abs/2204.03610) by [Jianwei Yang*](https://jwyang.github.io/), [Chunyuan Li*](https://chunyuan.li/), [Pengchuan Zhang*](https://pzzhang.github.io/pzzhang/), [Bin Xiao*](https://www.microsoft.com/en-us/research/people/bixi/), [Ce Liu](http://people.csail.mit.edu/celiu/), [Lu Yuan](https://scholar.google.com/citations?user=k9TsUVsAAAAJ&hl=en) and [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fjfgao%2F).
-## Motivation
-In this paper, we introduce a new perspective on commonly used image-label and image-text data by residing them in an image-text-label space. In this space, a new learning paradigm, called **Unified Contrastive Learning (UniCL)** with a single learning objective is proposed to seamlessly prompt the synergy of two data types. We demonstrate that UniCL is an effective way of learning **semantically rich yet discriminative representations**, universally for image recognition in zero-shot, linear-probe, fully finetuning and transfer learning scenarios. When scaled up to billions of data, UniCL can exclusively learn a powerful visual-semantic representation supporting dozens of downstream tasks shown in [Florence](https://arxiv.org/pdf/2111.11432v1.pdf).
-## Benchmarking
-### Image-label training augmented by image-text pairs
-| Model | Training Set | Top-1 on IN-1K | ZS on 14 datasets | Download
-| :----: | :---: | :---: | :---: | :---: |
-| Swin-T | IN-1K | 79.9 | 30.2 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in1k.pth)/[config](configs/unicl_swin_tiny.yaml)
-| Swin-T | IN-1K + GCC-3M | 80.2 | 39.0 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in1k_gcc3m.pth)/[config](configs/unicl_swin_tiny.yaml)
-| Swin-T | IN-1K + GYFCC-14M | 81.1 | 40.0 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in1k_yfcc14m.pth)/[config](configs/unicl_swin_tiny.yaml)
-| Swin-T | IN-1K + GCC-15M | 81.8 | 45.1 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in1k_gcc15m.pth)/[config](configs/unicl_swin_tiny.yaml)
-Note that all the above models are trained **without** strong data augmentations like mixup and cutmix.
-### Image-text learning augmented by image-label data
-| Model | Training Set | ZS on IN-1K | ZS on 14 datasets | Download
-| :----: | :---: | :---: | :---: | :---: |
-| Swin-T | YFCC-14M | 30.1 | 36.3 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/yfcc14m.pth)/[config](configs/unicl_swin_tiny.yaml)
-| Swin-T | IN-21K | 28.5 | 37.8 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in21k.pth)/[config](configs/unicl_swin_tiny.yaml)
-| Swin-T | IN-21K (half) + YFCC-14M (half) | 36.4 | 45.5 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in21k_yfcc14m_half.pth)/[config](configs/unicl_swin_tiny.yaml)
-| Swin-T | IN-21K + YFCC-14M | 40.5 | 49.1 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in21k_yfcc14m.pth)/[config](configs/unicl_swin_tiny.yaml)
-| Swin-B | YFCC-14M | 37.8 | - | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/yfcc14m_swin_base.pth)/[config](configs/unicl_swin_base.yaml)
-| Swin-B | IN-21K | 29.9 | 42.4 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in21k_swin_base.pth)/[config](configs/unicl_swin_base.yaml)
-| Swin-B | IN-21K (half) + YFCC-14M (half) | 41.1 | 48.5 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in21k_yfcc14m_half_swin_base.pth)/[config](configs/unicl_swin_base.yaml)
-| Swin-B | IN-21K + YFCC-14M | 44.3 | 52.2 | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in21k_yfcc14m_swin_base.pth)/[config](configs/unicl_swin_base.yaml)
-| Swin-B | IN-21K + YFCC-14M + GCC-15M | 57.9 | - | [ckpt](https://projects4jw.blob.core.windows.net/unicl/release/in21k_yfcc14m_gcc15m_swin_base.pth)/[config](configs/unicl_swin_base.yaml)



1
2	["Unifiled Contrastive Learning in Image-Text-Label Space. CVPR 2022"](https://arxiv.org/abs/2204.03610) by [Jianwei Yang](https://jwyang.github.io/), [Chunyuan Li](https://chunyuan.li/), [Pengchuan Zhang](https://pzzhang.github.io/pzzhang/), [Bin Xiao](https://www.microsoft.com/en-us/research/people/bixi/), [Ce Liu](http://people.csail.mit.edu/celiu/), [Lu Yuan](https://scholar.google.com/citations?user=k9TsUVsAAAAJ&hl=en) and [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fjfgao%2F).
3
4	+ In this paper, we introduce a new perspective on commonly used image-label and image-text data by residing them in an image-text-label space. In this space, a new learning paradigm, called Unified Contrastive Learning (UniCL) with a single learning objective is proposed to seamlessly prompt the synergy of two data types. We demonstrate that UniCL is an effective way of learning semantically rich yet discriminative representations, universally for image recognition in zero-shot, linear-probe, fully finetuning and transfer learning scenarios. When scaled up to billions of data, UniCL can exclusively learn a powerful visual-semantic representation supporting dozens of downstream tasks shown in [Florence](https://arxiv.org/pdf/2111.11432v1.pdf).