Edit model card

Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2

Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.

Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.

Intended uses & limitations

You can use the raw model for visual question answering.

How to use

Here is how to use this model in PyTorch:

from transformers import ViltProcessor, ViltForQuestionAnswering
import requests
from PIL import Image

# prepare image + question
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "How many cats are there?"

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

# forward pass
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])

Training data

(to do)

Training procedure

Preprocessing

(to do)

Pretraining

(to do)

Evaluation results

(to do)

BibTeX entry and citation info

@misc{kim2021vilt,
      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
      year={2021},
      eprint={2102.03334},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}
Downloads last month
12,182
Hosted inference API
Visual Question Answering
Examples
Examples
Drag image file here or click to browse from your device
This model can be loaded on the Inference API on-demand.

Spaces using dandelin/vilt-b32-finetuned-vqa 5