Spaces:
Runtime error
Runtime error
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# GIT | |
## Overview | |
The GIT model was proposed in [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by | |
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. GIT is a decoder-only Transformer | |
that leverages [CLIP](clip)'s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on | |
image captioning and visual question answering benchmarks. | |
The abstract from the paper is the following: | |
*In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.* | |
Tips: | |
- GIT is implemented in a very similar way to GPT-2, the only difference being that the model is also conditioned on `pixel_values`. | |
- One can use [`GitProcessor`] to prepare images for the model, and the `generate` method for autoregressive generation. | |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/git_architecture.jpg" | |
alt="drawing" width="600"/> | |
<small> GIT architecture. Taken from the <a href="https://arxiv.org/abs/2205.14100" target="_blank">original paper</a>. </small> | |
This model was contributed by [nielsr](https://huggingface.co/nielsr). | |
The original code can be found [here](https://github.com/microsoft/GenerativeImage2Text). | |
## Resources | |
A list of official Hugging Face and community (indicated by π) resources to help you get started with GIT. | |
- Demo notebooks regarding inference + fine-tuning GIT on custom data can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GIT). | |
- See also: [Causal language modeling task guide](../tasks/language_modeling) | |
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. | |
The resource should ideally demonstrate something new instead of duplicating an existing resource. | |
## GitVisionConfig | |
[[autodoc]] GitVisionConfig | |
## GitVisionModel | |
[[autodoc]] GitVisionModel | |
- forward | |
## GitConfig | |
[[autodoc]] GitConfig | |
- all | |
## GitProcessor | |
[[autodoc]] GitProcessor | |
- __call__ | |
## GitModel | |
[[autodoc]] GitModel | |
- forward | |
## GitForCausalLM | |
[[autodoc]] GitForCausalLM | |
- forward |