(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
Microsoft COCO Dataset (Captioning)
Description
Microsoft COCO Captions dataset contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
Task
(from https://paperswithcode.com/task/image-captioning)
Image captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.
Metrics
Models are typically evaluated according to a BLEU or CIDER metric.
Leaderboard
(Ranked by BLEU-4)
Rank | Model | BLEU-4 | CIDEr | METEOR | SPICE | Resources |
---|---|---|---|---|---|---|
1 | OFA | 44.9 | 154.9 | 32.5 | 26.6 | paper, code |
2 | LEMON | 42.6 | 145.5 | 31.4 | 25.5 | paper |
3 | CoCa | 40.9 | 143.6 | 33.9 | 24.7 | paper |
4 | SimVLM | 40.6 | 143.3 | 33.7 | 25.4 | paper |
5 | VinVL | 41.0 | 140.9 | 31.1 | 25.2 | paper, code |
6 | OSCAR | 40.7 | 140.0 | 30.6 | 24.5 | paper, code |
7 | BLIP | 40.4 | 136.7 | 31.4 | 24.3 | paper, code, demo |
8 | M^2 | 39.1 | 131.2 | 29.2 | 22.6 | paper, code |
9 | BUTD | 36.5 | 113.5 | 27.0 | 20.3 | paper, code |
10 | ClipCap | 32.2 | 108.4 | 27.1 | 20.1 | paper, code |
Auto-Downloading
cd lavis/datasets/download_scripts && python download_coco.py
References
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick