promptcap-coco-vqa / README.md
yushihu's picture
Update README.md
a5c6185
|
raw
history blame
2.83 kB
---
license: openrail
inference: false
pipeline_tag: image-to-text
tags:
- image-to-text
- visual-question-answering
- image-captioning
datasets:
- coco
- textvqa
- VQAv2
- OK-VQA
- A-OKVQA
language:
- en
---
# QuickStart
## Installation
```
pip install promptcap
```
Two pipelines are included. One is for image captioning, and the other is for visual question answering.
## Captioning Pipeline
Please follow the prompt format, which will give the best performance.
Generate a prompt-guided caption by following:
```python
import torch
from promptcap import PromptCap
model = PromptCap("vqascore/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"
if torch.cuda.is_available():
model.cuda()
prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
print(model.caption(prompt, image))
```
To try generic captioning, just use "what does the image describe?"
```python
prompt = "what does the image describe?"
image = "glove_boy.jpeg"
print(model.caption(prompt, image))
```
PromptCap also support taking OCR inputs:
```python
prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"
print(model.caption(prompt, image, ocr))
```
## Visual Question Answering Pipeline
Different from typical VQA models, which are doing classification on VQAv2, PromptCap is open-domain and can be paired with arbitrary text-QA models.
Here we provide a pipeline for combining PromptCap with UnifiedQA.
```python
import torch
from promptcap import PromptCap_VQA
# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000"
vqa_model = PromptCap_VQA(promptcap_model="vqascore/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base")
if torch.cuda.is_available():
vqa_model.cuda()
question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
print(vqa_model.vqa(question, image))
```
Similarly, PromptCap supports OCR inputs
```python
question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"
print(vqa_model.vqa(question, image, ocr=ocr))
```
Because of the flexibility of Unifiedqa, PromptCap also supports multiple-choice VQA
```python
question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))
```
## Bibtex
```
@article{hu2022promptcap,
title={PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3},
author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
journal={arXiv preprint arXiv:2211.09699},
year={2022}
}
```