vit-gpt2 / README.md
ydshieh's picture
ydshieh HF staff
Update README.md
2f35862
|
raw
history blame
850 Bytes

๐Ÿ–ผ๏ธ When ViT meets GPT-2 ๐Ÿ“

An image captioning model ViT-GPT2 by combining the ViT model and a French GPT2 model.

Part of the Huggingface JAX/Flax event.

The GPT2 model source code is modified so it can accept an encoder's output. The pretained weights of both models are loaded, with a set of randomly initialized cross-attention weigths. The model is trained on 65000 images from the COCO dataset for about 1500 steps (batch_size=256), with the original English cpationis being translated to French for training purpose.

A HuggingFace Space demo for this model: ๐Ÿ–ผ๏ธ French Image Captioning Demo ๐Ÿ“