Transformers documentation

BLIP

Transformers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

BLIP

Overview

BLIP モデルは、BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation で Junnan Li、Dongxu Li、Caiming Xiong、Steven Hoi によって提案されました。。

BLIP は、次のようなさまざまなマルチモーダルタスクを実行できるモデルです。

視覚的な質問応答
画像とテキストの検索（画像とテキストのマッチング）
画像キャプション

論文の要約は次のとおりです。

視覚言語事前トレーニング (VLP) により、多くの視覚言語タスクのパフォーマンスが向上しました。ただし、既存の事前トレーニング済みモデルのほとんどは、理解ベースのタスクまたは世代ベースのタスクのいずれかでのみ優れています。さらに、最適ではない監視ソースである Web から収集されたノイズの多い画像とテキストのペアを使用してデータセットをスケールアップすることで、パフォーマンスの向上が大幅に達成されました。この論文では、視覚言語の理解と生成タスクの両方に柔軟に移行する新しい VLP フレームワークである BLIP を提案します。 BLIP は、キャプションをブートストラップすることでノイズの多い Web データを効果的に利用します。キャプショナーが合成キャプションを生成し、フィルターがノイズの多いキャプションを除去します。画像テキスト検索 (平均再現率 +2.7%@1)、画像キャプション作成 (CIDEr で +2.8%)、VQA ( VQA スコアは +1.6%)。 BLIP は、ゼロショット方式でビデオ言語タスクに直接転送した場合にも、強力な一般化能力を発揮します。コード、モデル、データセットがリリースされています。

このモデルは ybelkada によって提供されました。元のコードはここにあります。

Resources

Jupyter ノートブックカスタムデータセットの画像キャプション用に BLIP を微調整する方法

Transformers

BLIP

Overview

Resources

BlipConfig

class transformers.BlipConfig

from_text_vision_configs

BlipTextConfig

class transformers.BlipTextConfig

BlipVisionConfig

class transformers.BlipVisionConfig

BlipProcessor

class transformers.BlipProcessor

BlipImageProcessor

class transformers.BlipImageProcessor

preprocess

BlipImageProcessorFast

class transformers.BlipImageProcessorFast

preprocess

BlipModel

class transformers.BlipModel

forward

get_text_features

get_image_features

BlipTextModel

class transformers.BlipTextModel

forward

BlipVisionModel

class transformers.BlipVisionModel

forward

BlipForConditionalGeneration

class transformers.BlipForConditionalGeneration

forward

BlipForImageTextRetrieval

class transformers.BlipForImageTextRetrieval

forward

BlipForQuestionAnswering

class transformers.BlipForQuestionAnswering

forward

TFBlipModel

class transformers.TFBlipModel

call

get_text_features

get_image_features

TFBlipTextModel

class transformers.TFBlipTextModel

call

TFBlipVisionModel

class transformers.TFBlipVisionModel

call

TFBlipForConditionalGeneration

class transformers.TFBlipForConditionalGeneration

call

TFBlipForImageTextRetrieval

class transformers.TFBlipForImageTextRetrieval

call

TFBlipForQuestionAnswering

class transformers.TFBlipForQuestionAnswering

call