Kosmos-2: Grounding Multimodal Large Language Models to the World

[paper] [dataset] [online demo hosted by HuggingFace]

Aug 2023: We acknowledge ydshieh at HuggingFace for the online demo and the HuggingFace's transformers implementation.
June 2023: 🔥 We release the Kosmos-2: Grounding Multimodal Large Language Models to the World paper. Checkout the paper.
Feb 2023: Kosmos-1 (Language Is Not All You Need: Aligning Perception with Language Models)
June 2022: MetaLM (Language Models are General-Purpose Interfaces)

Kosmos-2: Grounding Multimodal Large Language Models to the World

Checkpoints

The model can be loaded with the HuggingFace's transformers library.

The checkpoint can be downloaded from here:

wget -O kosmos-2.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmos-2/kosmos-2.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"

Setup

Download the recommended docker image and launch it:

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} nvcr.io/nvidia/pytorch:22.10-py3 bash

Clone the repo:

git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-2

Install the packages:

bash vl_setup_xl.sh

(Refer to comment for detailed package info)

Alternatively, you can refer to this guide to set up a conda environment.

Demo

We acknowledge ydshieh at HuggingFace for implementing an online demo.

If you would like to host a local Gradio demo, run the following command after setup:

bash run_gradio.sh

GRIT: Large-Scale Training Corpus of Grounded Image-Text Pairs

We introduce GRIT, a large-scale dataset of Grounded Image-Text pairs, which is created based on image-text pairs from a subset of COYO-700M and LAION-2B. We construct a pipeline to extract and link text spans (i.e., noun phrases, and referring expressions) in the caption to their corresponding image regions. More details can be found in the paper.

Download Data

GrIT-20M: The split contains about 20M grounded image-caption pairs processed from COYO-700M. We also release it on huggingface.

The format of data instance is:

{
  'clip_similarity_vitb32': 0.353271484375, 
  'clip_similarity_vitl14': 0.2958984375, 
  'id': 1795296605919, 
  'url': "https://www.thestrapsaver.com/wp-content/uploads/customerservice-1.jpg", 
  'caption': 'a wire hanger with a paper cover that reads we heart our customers', 
  'width': 1024, 
  'height': 693, 
  'noun_chunks': [[19, 32, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 13, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]], 
  'ref_exps': [[19, 66, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 66, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]]
}

clip_similarity_vitb32: The cosine similarity between text and image(ViT-B/32) embeddings by OpenAI CLIP, provided by COYO-700M.
clip_similarity_vitl14: The cosine similarity between text and image(ViT-L/14) embeddings by OpenAI CLIP, provided by COYO-700M.
id: Unique 64-bit integer ID in COYO-700M.
url: The image URL.
caption: The corresponding caption.
width: The width of the image.
height: The height of the image.
noun_chunks: The noun chunks (extracted by spaCy) that have associated bounding boxes (predicted by GLIP). The items in the children list respectively represent 'Start of the noun chunk in caption', 'End of the noun chunk in caption', 'normalized x_min', 'normalized y_min', 'normalized x_max', 'normalized y_max', 'confidence score'.
ref_exps: The corresponding referring expressions. If a noun chunk has no expansion, we just copy it.

Run the following commands to visualize it:

wget -O /tmp/grit_coyo.jsonl "https://conversationhub.blob.core.windows.net/beit-share-public/kosmos-2/data/grit_coyo.jsonl?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"

python data/visualize_grit.py

We recommend using img2dataset to download images, as detailed here.

Evaluation

1. Phrase grounding

We evaluate phrase grounding task on Flickr30k Entities under zero-shot setting:

Model	Recall@1 on val split	Recall@1 on test split
Kosmos-2	77.8	78.7

More results and evaluation code can be found in evaluation/flickr/README.md

2. Referring expression comprehension

We evaluate referring expression comprehension task on RefCOCO, RefCOCO+, and RefCOCOg under zero-shot setting. We report accuracy metric here.

Model	RefCOCO val	RefCOCO testA	RefCOCO testB	RefCOCO+ val	RefCOCO+ testA	RefCOCO+ testB	RefCOCOg val	RefCOCOg test
Kosmos-2	52.32	57.42	47.26	45.48	50.73	42.24	60.57	61.65

More results and evaluation code can be found in evaluation/refcoco/README.md

3. Referring expression generation

We evaluate the referring expression generation task on RefCOCOg under zero-shot and few-shot settings. We report Meteor and CIDEr metrics here.

Model	Setting	Meteor	CIDEr
Kosmos-2	zero-shot	12.2	60.3
Kosmos-2	few-shot (k=2)	13.8	62.2
Kosmos-2	few-shot (k=4)	14.1	62.2

We will release the evaluation code in here.

4. Image captioning

We evaluate the image captioning task on Flickr30K Karpathy split test set under the zero-shot setting. We report the CIDEr metric here.

Model	CIDEr on Flickr30K
Flamingo-3B	60.6
Flamingo-9B	61.5
Kosmos-1	67.1
Kosmos-2	80.5

We will release the evaluation code in here.

5. Visual question answering

We evaluate the visual question-answering task on the test-dev set of VQAv2 under the zero-shot setting. We report VQA scores obtained from the VQAv2 evaluation server.

Model	Accuracy on VQAv2
Flamingo-3B	49.2
Flamingo-9B	51.8
Kosmos-1	51.0
Kosmos-2	51.1

We will release the evaluation code in here.

Training

Preparing dataset

GrIT

After downloading the data from huggingface using img2dataset, you will obtain some tar files. After decompressing them, you can get the images and corresponding JSON files. Then, modify the file path in prepare_grit.py and run this file to get the corresponding tsv files. If a tsv file is too large, you can split it into multiple ones yourself.

After processing all the tar files into tsv files, run generate_config.py to get a config file, which stores the paths of the tsv files. In train.sh, change the --laion-data-dir to the config directory path.

Interleaved data

Interleaved image-text data also needs to be processed in this way. To be updated.

Text data

To be updated.

Train script

After preparing the data, run the following command to train the model.

bash train.sh

More training/instruction-tuning tasks will be updated.

Citation

If you find this repository useful, please consider citing our work:

@article{kosmos-2,
  title={Kosmos-2: Grounding Multimodal Large Language Models to the World},
  author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2306}
}

@article{kosmos-1,
  title={Language Is Not All You Need: Aligning Perception with Language Models},
  author={Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2302.14045}
}

@article{metalm,
  title={Language Models are General-Purpose Interfaces},
  author={Yaru Hao and Haoyu Song and Li Dong and Shaohan Huang and Zewen Chi and Wenhui Wang and Shuming Ma and Furu Wei},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.06336}
}

Acknowledgement

This repository is built using torchscale, fairseq, openclip. We also would like to acknowledge the examples provided by WHOOPS!. We acknowledge ydshieh at HuggingFace for the online demo and the HuggingFace's transformers implementation.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using models, please submit a GitHub issue.

Spaces:

Tzktz
/

Dit-document-layout-analysis

Sleeping

Kosmos-2: Grounding Multimodal Large Language Models to the World

Contents

Checkpoints

Setup

Demo

GRIT: Large-Scale Training Corpus of Grounded Image-Text Pairs

Download Data

Evaluation

1. Phrase grounding

2. Referring expression comprehension

3. Referring expression generation

4. Image captioning

5. Visual question answering

Training

Preparing dataset

GrIT

Interleaved data

Text data

Train script

Citation

Acknowledgement

License

Contact Information