# What's in here? This repo contains the code for our EMNLP 2021 paper: [CLIPScore: A Reference-free Evaluation Metric for Image Captioning](https://arxiv.org/abs/2104.08718). CLIPScore is a metric that you can use to evaluate the quality of an automatic image captioning system. In our paper, we show that CLIPScore achieves high correlation with human judgment on literal image captioning tasks. However, unlike BLEU or CIDEr, CLIPScore doesn't require reference captions. If you find the paper or this code useful, please consider citing: ``` @inproceedings{hessel2021clipscore, title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning}, author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin}, booktitle={EMNLP}, year={2021} } ``` # How do I run the code? ## Command Line Example usage ``` > python clipscore.py example/good_captions.json example/images/ ... CLIPScore: 0.8584 ``` If you include optionally some references, you will see RefCLIPScore, alongside a usual set of caption generation evaluation metrics. The references are optional. ``` > python clipscore.py example/good_captions.json example/images/ --references_json example/refs.json ... BLEU-1: 0.6667 BLEU-2: 0.4899 BLEU-3: 0.3469 BLEU-4: 0.0000 METEOR: 0.3444 ROUGE: 0.4280 CIDER: 0.5637 SPICE: 0.4000 CLIPScore: 0.8584 RefCLIPScore: 0.8450 ``` Worse captions should get lower scores: ``` > python clipscore.py example/bad_captions.json example/images/ --references_json example/refs.json ... BLEU-1: 0.4815 BLEU-2: 0.2404 BLEU-3: 0.1359 BLEU-4: 0.0000 METEOR: 0.1861 ROUGE: 0.3121 CIDER: 0.2790 SPICE: 0.1500 CLIPScore: 0.7153 RefCLIPScore: 0.7253 ``` You can treat/report CLIPScore and RefCLIPScore similarly to the other evaluation metrics. See the paper for more details about CLIPScore and RefCLIPScore. Full usage options can be given by `python clipscore.py -h`. An example set of inputs, including a candidate json, image directory, and references json is given this repo under `example/` The input files are formatted as follows: The candidates json should be a dictionary that maps from `{"string_image_identifier": "candidate"}`, e.g., ``` {'image1': 'an orange cat and a grey cat are lying together.', 'image2': 'a black dog looks at the camera.' ...} ``` The image directory should be a directory containing the images that act as the keys in the candidates json, e.g., ``` images/ ├── image1.jpg └── image2.jpg ``` and, finally, the references json should be a dictionary that maps from `{"string_image_identifier": ["list", "of", "references"]}`, e.g., ``` {"image1": ["two cats are sleeping next to each other.", "a grey cat is cuddling with an orange cat on a blanket.", "the orange cat is happy that the black cat is close to it."], "image2": ["a dog is wearing ear muffs as it lies on a carpet.", "a black dog and an orange cat are looking at the photographer.", "headphones are placed on a dogs ears."]} ``` ## MSCOCO dataset in pycocoevalcap If you're running on the MSCOCO dataset and using the standard evaluation toolkit, you can use our version of [pycocoevalcap](https://github.com/jmhessel/pycocoevalcap) to evaluate. You won't even need to download the original MSCOCO images, thanks to a bit of magic :-) To use `pycocoevalcap` on the MSCOCO dataset in the MSCOCO format, you can simply use: ``` pip install git+https://github.com/jmhessel/pycocoevalcap.git ``` there is an example evaluation in that repo under `examples/eval.py`. After pip installing, if you clone the `pycocoeval` repo and run ``` python eval.py ``` after a bit of time, the output should be: ``` Bleu_1: 0.579 Bleu_2: 0.404 Bleu_3: 0.279 Bleu_4: 0.191 METEOR: 0.195 ROUGE_L: 0.396 CIDEr: 0.600 SPICE: 0.133 CLIPScore: 0.528 RefCLIPScore: 0.605 ``` ## Reproducibility notes: - CLIPScore can run either on CPU or GPU. But, there are slight differences due to floating point precision. As discussed [here](https://github.com/openai/CLIP/issues/30#issuecomment-771099118), on CPU, all operations run in `float32`, but on GPU, some operations run in `float16`. The differences are generally small (e.g., for the example run above, with `example/good_captions.json` captions and `example/images/` images, on CPU, the output is `CLIPScore: 0.8585`, but on GPU, the output is `CLIPScore: 0.8584`.) *All experiments in the paper were run on GPU, and this code will raise a warning if you're not using a GPU.* - Because CLIPScore depends on the images to compute, resizing, compressing, etc. can all cause slight differences in computing the CLIPScore. Even saving a jpg twice can result in different compression, because that format is lossy! To this end, we release the checksums of the images we used for the paper. see `checksums/` for more info. For the pycocoevalcap repo, we have also included the checksums for MSCOCO --- see [here](https://github.com/jmhessel/pycocoevalcap/tree/master/clipscore) for more info. - The prompt we used for the text side of CLIP, as mentioned in the paper is ``A photo depicts" This is hard-coded into this repo. Other prompts will result in slightly different results, and we don't recommend them for the sake of reproducibility. ## Acknowledgment The authors would like to thank Jungo Kasai for being the first to use this repo. Thanks to Jungo, we fixed a few issues, and added some information about reproducibility that was missing before.