# :label: Tag2Text: Guiding Vision-Language Model via Image Tagging Official PyTorch Implementation of the Tag2Text, an efficient and controllable vision-language model with tagging guidance. Code is available now! Welcome to try out [Tag2Text Web demo🤗](https://huggingface.co/spaces/xinyu1205/Tag2Text)! Both Tagging and Captioning are included. Tag2Text now is combine with [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything), which can automatically recognize, detect, and segment for an image! Tag2Text showcases powerful image recognition capabilities: ![](./images/tag2text_grounded_sam.jpg) ## :fire: News - **`2023/05/20`**: Tag2Text is combined with [VideoChat](https://github.com/OpenGVLab/Ask-Anything), Tag2Text provides powerful tagging and captioning capabilities as a fundamental component! - **`2023/04/20`**: We marry [Tag2Text with with Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) to provide powerful image recognition capabilities! - **`2023/04/10`**: Code and checkpoint is available Now! - **`2023/03/14`**: [Tag2Text web demo 🤗](https://huggingface.co/spaces/xinyu1205/Tag2Text) is available on Hugging Face Space! ## :bulb: Highlight - **Tagging.** Without manual annotations, Tag2Text achieves **superior** image tag recognition ability of [**3,429**](./data/tag_list.txt) commonly human-used categories. - **Efficient.** Tagging guidance effectively enhances the performance of vision-language models on both **generation-based** and **alignment-based** tasks. - **Controllable.** Tag2Text permits users to input **desired tags**, providing the flexibility in composing corresponding texts based on the input tags.

## :writing_hand: TODO - [x] Release demo. - [x] Release checkpoints. - [x] Release inference code. - [ ] Release training codes. - [ ] Release training datasets. ## :toolbox: Checkpoints
name backbone Data Illustration Checkpoint
1 Tag2Text-Swin Swin-Base COCO, VG, SBU, CC-3M, CC-12M Demo version with comprehensive captions. Download link
## :running: Model Inference 1. Install the dependencies, run:
pip install -r requirements.txt
2. Download Tag2Text pretrained checkpoints. 1. Get the tagging and captioning results:
   python inference.py  --image images/1641173_2291260800.jpg \
   --pretrained pretrained/tag2text_swin_14m.pth
   
Or get the tagging and sepcifed captioning results (optional):
python inference.py  --image images/1641173_2291260800.jpg \
   --pretrained pretrained/tag2text_swin_14m.pth \
   --specified-tags "cloud,sky"
## :black_nib: Citation If you find our work to be useful for your research, please consider citing. ``` @article{huang2023tag2text, title={Tag2Text: Guiding Vision-Language Model via Image Tagging}, author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei}, journal={arXiv preprint arXiv:2303.05657}, year={2023} } ``` ## :hearts: Acknowledgements This work is done with the help of the amazing code base of [BLIP](https://github.com/salesforce/BLIP), thanks very much! We also want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in [marrying Tag2Text with Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything).