Spaces:

Vishakaraj
/

Dense_Captioning_-_GRiT

Runtime error

App Files Files Community

Dense_Captioning_-_GRiT / README.md

Vishakaraj

Upload folder using huggingface_hub

c709b60 10 months ago

preview code

raw

history blame contribute delete

No virus

6.78 kB

	---
	title: Dense_Captioning_-_GRiT
	app_file: app.py
	sdk: gradio
	sdk_version: 3.42.0
	---
	# GRiT: A Generative Region-to-text Transformer for Object Understanding
	GRiT is a general and open-set object understanding framework that localizes objects and
	describes them with any style of free-form texts it was trained with, e.g., class names, descriptive sentences
	(including object attributes, actions, counts and many more).

	> [GRiT: A Generative Region-to-text Transformer for Object Understanding](https://arxiv.org/abs/2212.00280) \
	> Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang \
	> <sup>1</sup>State University of New York at Buffalo, <sup>2</sup>Microsoft \
	> arXiv technical report ([PDF](https://arxiv.org/pdf/2212.00280.pdf))

	<p align="center"> <img src='docs/grit.png' align="center" height="400px"> </p>

	## Installation

	Please follow [Installation instructions](docs/INSTALL.md).

	## ChatGPT with GRiT
	We give ChatGPT GRiT's dense captioning outputs (object location and description) to have it
	describe the scene and even write poetry. ChatGPT can generate amazing scene descriptions given our dense
	captioning outputs. An example is shown below: :star_struck::star_struck::star_struck:

	<p align="center"> <img src='docs/chatgpt.png' align="center"> </p>


	## Object Understanding Demo - One Model Two tasks

	[Download the GRiT model](https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth) or use the following commend to download:
	~~~
	mkdir models && cd models
	wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth && cd ..
	~~~
	The downloaded GRiT model was jointly trained on dense captioning
	task and object detection task. With the same trained model, it can
	output both rich descriptive sentences and short class names by varying
	the flag `--test-task`. Play it as follows! :star_struck:

	### Output for Dense Captioning (rich descriptive sentences)

	~~~
	python demo.py --test-task DenseCap --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth
	~~~

	### Output for Object Detection (short class names)

	~~~
	python demo.py --test-task ObjectDet --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth
	~~~
	Output images will be saved under the `visualization` folder, which looks like:
	<p align="center"> <img src='docs/demo.png' align="center"> </p>

	You can also try the Colab demo provided by the [TWC team](https://github.com/taskswithcode): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taskswithcode/GriT/blob/master/TWCGRiT.ipynb)


	## Benchmark Inference and Evaluation
	Please follow [dataset preparation instructions](datasets/DATASETS.md) to download datasets.

	Download our trained models and put them to `models/` for evaluation.
	### Object Detection on COCO 2017 Dataset

	\| Model \| val AP \| test-dev AP \| Download \|
	\|-----------------------\|-----------------\|----------\|----------\|
	\|[GRiT (ViT-B)](configs/GRiT_B_ObjectDet.yaml)\|53.7\|53.8\| [model](https://datarelease.blob.core.windows.net/grit/models/grit_b_objectdet.pth) \|
	\|[GRiT (ViT-L)](configs/GRiT_L_ObjectDet.yaml)\|56.4\|56.6\| [model](https://datarelease.blob.core.windows.net/grit/models/grit_l_objectdet.pth) \|
	\|[GRiT (ViT-H)](configs/GRiT_H_ObjectDet.yaml)\|60.4\|60.4\| [model](https://datarelease.blob.core.windows.net/grit/models/grit_h_objectdet.pth) \|

	To evaluate the trained GRiT on coco 2017 val, run:
	~~~
	# GRiT (ViT-B)
	python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet --eval-only MODEL.WEIGHTS models/grit_b_objectdet.pth
	# GRiT (ViT-L)
	python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_L_ObjectDet.yaml --output-dir-name ./output/grit_l_objectdet --eval-only MODEL.WEIGHTS models/grit_l_objectdet.pth
	# GRiT (ViT-H)
	python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_H_ObjectDet.yaml --output-dir-name ./output/grit_h_objectdet --eval-only MODEL.WEIGHTS models/grit_h_objectdet.pth
	~~~

	### Dense Captioning on VG Dataset
	\| Model \| mAP \| Download \|
	\|-----------------------\|-----------------\|----------\|
	\|[GRiT (ViT-B)](configs/GRiT_B_DenseCap.yaml)\|15.5\| [model](https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap.pth) \|

	To test on VG test set, run:
	~~~
	python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth
	~~~
	It will save the inference results to `output/grit_b_densecap/vg_instances_results.json`.
	We use the VG dense captioning [official evaluation codebase](https://github.com/jcjohnson/densecap)
	to report the results. We didn't integrate the evaluation code into our project as it was written in Lua.
	To evaluate on VG, please follow the original codebase's instructions and test based upon it. We're happy to discuss
	in our issue section about the issues you may encounter when using their code.

	## Training
	To save training memory, we use [DeepSpeed](https://github.com/microsoft/DeepSpeed) for training which can work well for
	[activation checkpointing](https://pytorch.org/docs/stable/checkpoint.html) in distributed training.

	To train on single machine node, run:
	~~~
	python train_deepspeed.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet
	~~~

	To train on multiple machine nodes, run:
	~~~
	python train_deepspeed.py --num-machines 4 --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet
	~~~

	## Acknowledgement
	Our code is in part based on [Detic](https://github.com/facebookresearch/Detic),
	[CenterNet2](https://github.com/xingyizhou/CenterNet2),
	[detectron2](https://github.com/facebookresearch/detectron2),
	[GIT](https://github.com/microsoft/GenerativeImage2Text), and
	[transformers](https://github.com/huggingface/transformers).
	We thank the authors and appreciate their great works!

	## Citation

	If you find our work interesting and would like to cite it, please use the following BibTeX entry.

	@article{wu2022grit,
	title={GRiT: A Generative Region-to-text Transformer for Object Understanding},
	author={Wu, Jialian and Wang, Jianfeng and Yang, Zhengyuan and Gan, Zhe and Liu, Zicheng and Yuan, Junsong and Wang, Lijuan},
	journal={arXiv preprint arXiv:2212.00280},
	year={2022}
	}