ydshieh HF staff commited on
Commit
792da97
โ€ข
1 Parent(s): 2f35862

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -1,4 +1,4 @@
1
- **๐Ÿ–ผ๏ธ When ViT meets GPT-2 ๐Ÿ“**
2
 
3
  An image captioning model [ViT-GPT2](https://huggingface.co/flax-community/vit-gpt2/tree/main) by combining the ViT model and a French GPT2 model.
4
 
@@ -8,4 +8,12 @@ The GPT2 model source code is modified so it can accept an encoder's output.
8
  The pretained weights of both models are loaded, with a set of randomly initialized cross-attention weigths.
9
  The model is trained on 65000 images from the COCO dataset for about 1500 steps (batch\_size=256), with the original English cpationis being translated to French for training purpose.
10
 
 
 
 
 
 
 
 
 
11
  A HuggingFace Space demo for this model: [๐Ÿ–ผ๏ธ French Image Captioning Demo ๐Ÿ“](https://huggingface.co/spaces/flax-community/image-caption-french)
 
1
+ # ๐Ÿ–ผ๏ธ When ViT meets GPT-2 ๐Ÿ“
2
 
3
  An image captioning model [ViT-GPT2](https://huggingface.co/flax-community/vit-gpt2/tree/main) by combining the ViT model and a French GPT2 model.
4
 
 
8
  The pretained weights of both models are loaded, with a set of randomly initialized cross-attention weigths.
9
  The model is trained on 65000 images from the COCO dataset for about 1500 steps (batch\_size=256), with the original English cpationis being translated to French for training purpose.
10
 
11
+ **Technical challenges**
12
+
13
+ - The source code of Flax's version of GPT-2 is modified to be able to accept an encoder's outputs, so it can be used as a decoder in an encoder-decoder architecture.
14
+
15
+ - Originally, we created [**FlaxViTGPT2ForConditionalGenerationModule**](https://huggingface.co/flax-community/vit-gpt2/blob/main/vit_gpt2/modeling_flax_vit_gpt2.py#L86), which is [**FlaxViTGPT2Module**](https://huggingface.co/flax-community/vit-gpt2/blob/main/vit_gpt2/modeling_flax_vit_gpt2.py#L28) (ViT + [GPT-2 without LM head]) with an extra LM head. However, when loading the pretrained French GPT-2 model, the LM head's weigths are not loaded. We therefore created [**FlaxViTGPT2LMForConditionalGenerationModule**](https://huggingface.co/flax-community/vit-gpt2/blob/main/vit_gpt2/modeling_flax_vit_gpt2_lm.py#L101) which is `ViT + [GPT-2 with LM head]`, and we no longer need to add a LM head over it. By doing so, the pretrained LM head's weights are also loaded, and the only randomly initialized weigths are the cross-attention weights.
16
+
17
+ - The provided training script `run_summarization.py` is modified to send pixel values to the model instead of a sequence of input token ids, and a necessary change due to the ViT model not accepting an `attention_mask` argument.
18
+
19
  A HuggingFace Space demo for this model: [๐Ÿ–ผ๏ธ French Image Captioning Demo ๐Ÿ“](https://huggingface.co/spaces/flax-community/image-caption-french)