sections/pretraining/intro.md · flax-community/multilingual-image-captioning at e6c6e8ba46d3e9fccd5dfb46243bc27810a4383b

We follow an encoder-decoder approach for image captioning, where the image encoder is the CLIP Vision model (a ViT transformer). The pre-training task is image-to-text generation. We take the input tokens and shift them using an <eos> token towards right in order to create the inputs for our model, while the original input tokens become labels. The model is trained on the dataset. in an end-to-end fashion.