Spaces:
Runtime error
Runtime error
<!--Copyright 2021 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# VisionTextDualEncoder | |
## Overview | |
The [`VisionTextDualEncoderModel`] can be used to initialize a vision-text dual encoder model with | |
any pretrained vision autoencoding model as the vision encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit)) and any pretrained text autoencoding model as the text encoder (*e.g.* [RoBERTa](roberta), [BERT](bert)). Two projection layers are added on top of both the vision and text encoder to project the output embeddings | |
to a shared latent space. The projection layers are randomly initialized so the model should be fine-tuned on a | |
downstream task. This model can be used to align the vision-text embeddings using CLIP like contrastive image-text | |
training and then can be used for zero-shot vision tasks such image-classification or retrieval. | |
In [LiT: Zero-Shot Transfer with Locked-image Text Tuning](https://arxiv.org/abs/2111.07991) it is shown how | |
leveraging pre-trained (locked/frozen) image and text model for contrastive learning yields significant improvement on | |
new zero-shot vision tasks such as image classification or retrieval. | |
## VisionTextDualEncoderConfig | |
[[autodoc]] VisionTextDualEncoderConfig | |
## VisionTextDualEncoderProcessor | |
[[autodoc]] VisionTextDualEncoderProcessor | |
## VisionTextDualEncoderModel | |
[[autodoc]] VisionTextDualEncoderModel | |
- forward | |
## FlaxVisionTextDualEncoderModel | |
[[autodoc]] FlaxVisionTextDualEncoderModel | |
- __call__ | |
## TFVisionTextDualEncoderModel | |
[[autodoc]] TFVisionTextDualEncoderModel | |
- call | |