Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.20.0
Kosmos-2: Grounding Multimodal Large Language Models to the World
[paper] [dataset] [online demo hosted by HuggingFace]
- Aug 2023: We acknowledge ydshieh at HuggingFace for the online demo and the HuggingFace's transformers implementation.
- June 2023: 🔥 We release the Kosmos-2: Grounding Multimodal Large Language Models to the World paper. Checkout the paper.
- Feb 2023: Kosmos-1 (Language Is Not All You Need: Aligning Perception with Language Models)
- June 2022: MetaLM (Language Models are General-Purpose Interfaces)
Contents
- Kosmos-2: Grounding Multimodal Large Language Models to the World
Checkpoints
The model can be loaded with the HuggingFace's transformers library.
The checkpoint can be downloaded from here:
wget -O kosmos-2.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmos-2/kosmos-2.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"
Setup
- Download the recommended docker image and launch it:
alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} nvcr.io/nvidia/pytorch:22.10-py3 bash
- Clone the repo:
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-2
- Install the packages:
bash vl_setup_xl.sh
(Refer to comment for detailed package info)
Alternatively, you can refer to this guide to set up a conda environment.
Demo
We acknowledge ydshieh at HuggingFace for implementing an online demo.
If you would like to host a local Gradio demo, run the following command after setup:
bash run_gradio.sh
GRIT: Large-Scale Training Corpus of Grounded Image-Text Pairs
We introduce GRIT, a large-scale dataset of Grounded Image-Text pairs, which is created based on image-text pairs from a subset of COYO-700M and LAION-2B. We construct a pipeline to extract and link text spans (i.e., noun phrases, and referring expressions) in the caption to their corresponding image regions. More details can be found in the paper.
Download Data
- GrIT-20M: The split contains about 20M grounded image-caption pairs processed from COYO-700M. We also release it on huggingface.
The format of data instance is:
{
'clip_similarity_vitb32': 0.353271484375,
'clip_similarity_vitl14': 0.2958984375,
'id': 1795296605919,
'url': "https://www.thestrapsaver.com/wp-content/uploads/customerservice-1.jpg",
'caption': 'a wire hanger with a paper cover that reads we heart our customers',
'width': 1024,
'height': 693,
'noun_chunks': [[19, 32, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 13, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]],
'ref_exps': [[19, 66, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 66, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]]
}
clip_similarity_vitb32
: The cosine similarity between text and image(ViT-B/32) embeddings by OpenAI CLIP, provided by COYO-700M.clip_similarity_vitl14
: The cosine similarity between text and image(ViT-L/14) embeddings by OpenAI CLIP, provided by COYO-700M.id
: Unique 64-bit integer ID in COYO-700M.url
: The image URL.caption
: The corresponding caption.width
: The width of the image.height
: The height of the image.noun_chunks
: The noun chunks (extracted by spaCy) that have associated bounding boxes (predicted by GLIP). The items in the children list respectively represent 'Start of the noun chunk in caption', 'End of the noun chunk in caption', 'normalized x_min', 'normalized y_min', 'normalized x_max', 'normalized y_max', 'confidence score'.ref_exps
: The corresponding referring expressions. If a noun chunk has no expansion, we just copy it.
Run the following commands to visualize it:
wget -O /tmp/grit_coyo.jsonl "https://conversationhub.blob.core.windows.net/beit-share-public/kosmos-2/data/grit_coyo.jsonl?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"
python data/visualize_grit.py
We recommend using img2dataset to download images, as detailed here.
Evaluation
1. Phrase grounding
We evaluate phrase grounding task on Flickr30k Entities under zero-shot setting:
Model | Recall@1 on val split | Recall@1 on test split |
---|---|---|
Kosmos-2 | 77.8 | 78.7 |
More results and evaluation code can be found in evaluation/flickr/README.md
2. Referring expression comprehension
We evaluate referring expression comprehension task on RefCOCO, RefCOCO+, and RefCOCOg under zero-shot setting. We report accuracy metric here.
Model | RefCOCO val | RefCOCO testA | RefCOCO testB | RefCOCO+ val | RefCOCO+ testA | RefCOCO+ testB | RefCOCOg val | RefCOCOg test |
---|---|---|---|---|---|---|---|---|
Kosmos-2 | 52.32 | 57.42 | 47.26 | 45.48 | 50.73 | 42.24 | 60.57 | 61.65 |
More results and evaluation code can be found in evaluation/refcoco/README.md
3. Referring expression generation
We evaluate the referring expression generation task on RefCOCOg under zero-shot and few-shot settings. We report Meteor and CIDEr metrics here.
Model | Setting | Meteor | CIDEr |
---|---|---|---|
Kosmos-2 | zero-shot | 12.2 | 60.3 |
Kosmos-2 | few-shot (k=2) | 13.8 | 62.2 |
Kosmos-2 | few-shot (k=4) | 14.1 | 62.2 |
We will release the evaluation code in here.
4. Image captioning
We evaluate the image captioning task on Flickr30K Karpathy split test set under the zero-shot setting. We report the CIDEr metric here.
Model | CIDEr on Flickr30K |
---|---|
Flamingo-3B | 60.6 |
Flamingo-9B | 61.5 |
Kosmos-1 | 67.1 |
Kosmos-2 | 80.5 |
We will release the evaluation code in here.
5. Visual question answering
We evaluate the visual question-answering task on the test-dev set of VQAv2 under the zero-shot setting. We report VQA scores obtained from the VQAv2 evaluation server.
Model | Accuracy on VQAv2 |
---|---|
Flamingo-3B | 49.2 |
Flamingo-9B | 51.8 |
Kosmos-1 | 51.0 |
Kosmos-2 | 51.1 |
We will release the evaluation code in here.
Training
Preparing dataset
GrIT
After downloading the data from huggingface using img2dataset, you will obtain some tar files. After decompressing them, you can get the images and corresponding JSON files. Then, modify the file path in prepare_grit.py and run this file to get the corresponding tsv files. If a tsv file is too large, you can split it into multiple ones yourself.
After processing all the tar files into tsv files, run generate_config.py to get a config file, which stores the paths of the tsv files. In train.sh, change the --laion-data-dir
to the config directory path.
Interleaved data
Interleaved image-text data also needs to be processed in this way. To be updated.
Text data
To be updated.
Train script
After preparing the data, run the following command to train the model.
bash train.sh
More training/instruction-tuning tasks will be updated.
Citation
If you find this repository useful, please consider citing our work:
@article{kosmos-2,
title={Kosmos-2: Grounding Multimodal Large Language Models to the World},
author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei},
journal={ArXiv},
year={2023},
volume={abs/2306}
}
@article{kosmos-1,
title={Language Is Not All You Need: Aligning Perception with Language Models},
author={Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei},
journal={ArXiv},
year={2023},
volume={abs/2302.14045}
}
@article{metalm,
title={Language Models are General-Purpose Interfaces},
author={Yaru Hao and Haoyu Song and Li Dong and Shaohan Huang and Zewen Chi and Wenhui Wang and Shuming Ma and Furu Wei},
journal={ArXiv},
year={2022},
volume={abs/2206.06336}
}
Acknowledgement
This repository is built using torchscale, fairseq, openclip. We also would like to acknowledge the examples provided by WHOOPS!. We acknowledge ydshieh at HuggingFace for the online demo and the HuggingFace's transformers implementation.
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Microsoft Open Source Code of Conduct
Contact Information
For help or issues using models, please submit a GitHub issue.