Transformers documentation

BLIP

Transformers

You are viewing v4.39.2 version. A newer version v4.46.3 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

BLIP

Overview

The BLIP model was proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.

BLIP is a model that is able to perform various multi-modal tasks including:

Visual Question Answering
Image-Text retrieval (Image-text matching)
Image Captioning

The abstract from the paper is the following:

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.

This model was contributed by ybelkada. The original code can be found here.

Resources

Jupyter notebook on how to fine-tune BLIP for image captioning on a custom dataset

Transformers

BLIP

Overview

Resources

BlipConfig

class transformers.BlipConfig

from_text_vision_configs

BlipTextConfig

class transformers.BlipTextConfig

BlipVisionConfig

class transformers.BlipVisionConfig

BlipProcessor

class transformers.BlipProcessor

batch_decode

decode

BlipImageProcessor

class transformers.BlipImageProcessor

preprocess

BlipModel

class transformers.BlipModel

forward

get_text_features

get_image_features

BlipTextModel

class transformers.BlipTextModel

forward

BlipVisionModel

class transformers.BlipVisionModel

forward

BlipForConditionalGeneration

class transformers.BlipForConditionalGeneration

forward

BlipForImageTextRetrieval

class transformers.BlipForImageTextRetrieval

forward

BlipForQuestionAnswering

class transformers.BlipForQuestionAnswering

forward

TFBlipModel

class transformers.TFBlipModel

call

get_text_features

get_image_features

TFBlipTextModel

class transformers.TFBlipTextModel

call

TFBlipVisionModel

class transformers.TFBlipVisionModel

call

TFBlipForConditionalGeneration

class transformers.TFBlipForConditionalGeneration

call

TFBlipForImageTextRetrieval

class transformers.TFBlipForImageTextRetrieval

call

TFBlipForQuestionAnswering

class transformers.TFBlipForQuestionAnswering

call