What libraries can I use for Visual Question Answering?

The transformersand transformers.js libraries are compatible with Visual Question Answering.

What models can I use for Visual Question Answering?

The google/deplot, google/matcha-base, and google/pix2struct-ocrvqa-large models can be used for Visual Question Answering.

What datasets can I use for Visual Question Answering?

The Graphcore/vqaand facebook/textvqa datasets can be used for Visual Question Answering.

What metrics can I use for Visual Question Answering?

The accuracyand wu-palmer similarity metrics can be used for Visual Question Answering.

Tasks

Visual Question Answering

Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions.

Inputs

Question

What is in this image?

Visual Question Answering Model

Output

elephant

0.970

elephants

0.060

animal

0.003

About Visual Question Answering

Use Cases

Aid the Visually Impaired Persons

VQA models can be used to reduce visual barriers for visually impaired individuals by allowing them to get information about images from the web and the real world.

Education

VQA models can be used to improve experiences at museums by allowing observers to directly ask questions they interested in.

Improved Image Retrieval

Visual question answering models can be used to retrieve images with specific characteristics. For example, the user can ask "Is there a dog?" to find all images with dogs from a set of images.

Video Search

Specific snippets/timestamps of a video can be retrieved based on search queries. For example, the user can ask "At which part of the video does the guitar appear?" and get a specific timestamp range from the whole video.

Task Variants

Video Question Answering

Video Question Answering aims to answer questions asked about the content of a video.

Inference

You can infer with Visual Question Answering models using the vqa (or visual-question-answering) pipeline. This pipeline requires the Python Image Library (PIL) to process images. You can install it with (pip install pillow).

from PIL import Image
from transformers import pipeline

vqa_pipeline = pipeline("visual-question-answering")

image =  Image.open("elephant.jpeg")
question = "Is there an elephant?"

vqa_pipeline(image, question, top_k=1)
#[{'score': 0.9998154044151306, 'answer': 'yes'}]

Useful Resources

The contents of this page are contributed by Bharat Raghunathan and Jose Londono Botero.

Compatible libraries

Transformers

Transformers.js

Visual Question Answering demo

using dandelin/vilt-b32-finetuned-vqa

Models for Visual Question Answering

Browse Models (404)

google/deplot

Visual Question Answering • Updated Sep 6, 2023 • 40.7k • 244

Note A visual question answering model trained to convert charts and plots to text.

google/matcha-base

Visual Question Answering • Updated Jul 22, 2023 • 1.46k • 22

Note A visual question answering model trained for mathematical reasoning and chart derendering from images.

google/pix2struct-ocrvqa-large

Visual Question Answering • Updated May 19, 2023 • 65 • 34

Note A strong visual question answering that answers questions from book covers.

Datasets for Visual Question Answering

Browse Datasets (385)

Graphcore/vqa

Updated Oct 25, 2022 • 408 • 4

Note A widely used dataset containing questions (with answers) about images.

facebook/textvqa

Updated Jan 18 • 526 • 28

Note A dataset to benchmark visual reasoning based on text in images.

Spaces using Visual Question Answering

📚

merve/pix2struct

Note An application that compares visual question answering models across different tasks.

🌍

nielsr/vilt-vqa

Note An application that can answer questions based on images.

🦀

Salesforce/BLIP

Note An application that can caption images and answer questions about a given image.

Metrics for Visual Question Answering

accuracy: Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

wu-palmer similarity: Measures how much a predicted answer differs from the ground truth based on the difference in their semantic meaning.