bhavitvyamalik's picture
multiapp
54dc7b4
|
raw
history blame
412 Bytes

We follow an encoder-decoder approach for image captioning, where the image encoder is the CLIP Vision model (a ViT transformer). The pre-training task is image-to-text generation. We take the input tokens and shift them using an <eos> token towards right in order to create the inputs for our model, while the original input tokens become labels. The model is trained on the dataset. in an end-to-end fashion.