Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions.

Inputs
###### Question

What is in this image?

Output
elephant
0.970
elephants
0.060
animal
0.003

## Use Cases

### Aid the Visually Impaired Persons

VQA models can be used to reduce visual barriers for visually impaired individuals by allowing them to get information about images from the web and the real world.

### Education

VQA models can be used to improve experiences at museums by allowing observers to directly ask questions they interested in.

### Improved Image Retrieval

Visual question answering models can be used to retrieve images with specific characteristics. For example, the user can ask "Is there a dog?" to find all images with dogs from a set of images.

### Video Search

Specific snippets/timestamps of a video can be retrieved based on search queries. For example, the user can ask "At which part of the video does the guitar appear?" and get a specific timestamp range from the whole video.

## Inference

You can infer with Visual Question Answering models using the vqa (or visual-question-answering) pipeline. This pipeline requires the Python Image Library (PIL) to process images. You can install it with (pip install pillow).

from PIL import Image
from transformers import pipeline

image =  Image.open("elephant.jpeg")
question = "Is there an elephant?"

vqa_pipeline(image, question, top_k=1)


## Useful Resources

The contents of this page are contributed by Bharat Raghunathan and Jose Londono Botero.

Examples
Examples
Drag image file here or click to browse from your device
This model can be loaded on the Inference API on-demand.

Note Robust Visual Question Answering model trained on the VQAv2 dataset.