--- license: openrail inference: false pipeline_tag: image-to-text tags: - image-to-text - visual-question-answering - image-captioning datasets: - coco - textvqa - VQAv2 - OK-VQA - A-OKVQA language: - en --- # QuickStart ## Installation ``` pip install promptcap ``` ## Captioning Pipeline Generate a prompt-guided caption by following: ``` import torch from promptcap import PromptCap model = PromptCap("vqascore/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-base" if torch.cuda.is_available(): model.cuda() prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?" image = "glove_boy.jpeg" print(model.caption(prompt, image)) ``` To try generic captioning, just use "please describe this image according to the given question: what does the image describe?" PromptCap also support taking OCR inputs: ``` question = "what year was this taken?" image = "dvds.jpg" ocr = "yip AE Mht juor 02/14/2012" print(model.caption(prompt, image, ocr)) ``` ## Visual Question Answering Pipeline Different from typical VQA models, which are doing classification on VQAv2, PromptCap is open-domain and can be paired with arbitrary text-QA models. Here we provide a pipeline for combining PromptCap with UnifiedQA. ``` import torch from promptcap import PromptCap_VQA # QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000" vqa_model = PromptCap_VQA(promptcap_model="vqascore/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base") if torch.cuda.is_available(): vqa_model.cuda() question = "what piece of clothing is this boy putting on?" image = "glove_boy.jpeg" print(vqa_model.vqa(question, image)) ``` Similarly, PromptCap supports OCR inputs ``` question = "what year was this taken?" image = "dvds.jpg" ocr = "yip AE Mht juor 02/14/2012" print(vqa_model.vqa(prompt, image, ocr=ocr)) ``` Because of the flexibility of Unifiedqa, PromptCap also supports multiple-choice VQA ``` question = "what piece of clothing is this boy putting on?" image = "glove_boy.jpeg" choices = ["gloves", "socks", "shoes", "coats"] print(vqa_model.vqa_multiple_choice(question, image, choices)) ```