flax-community
/

clip-vit-base-patch32_marian-es

Transformers

JAX

TensorBoard

clip-vision-marian

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

gchhablani commited on Jul 21, 2021

Commit

7a6097b

•

1 Parent(s): 6929d06

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -1,12 +1,12 @@
 # CLIP-Vision-Marian Seq2Seq Encoder-Decoder Model
-Pretrained CLIP-Vision-Marian pre-trained on subset of Spanish translated Conceptual-12M image-text pairs using a seq2seq model training objective. 2.5M cleaned English image-text pairs are translated using Spanish Marian Model. We trained CLIP-Vision-Marian model during community week hosted by Huggingface 🤗 using JAX/Flax.
 ## Model description
-CLIP-Vision-Marian is a modified Marian model which takes in visual embeddings from CLIP-Vision transformer and concatenates them with Marian textual embeddings before passing them to the self-attention layers of Marian. This is done for deep cross-modal interaction between the two modes.
 ## Intended uses & limitations❗️
-You can use the raw model for encoder decoder network where you want the encoder to encode images and decoder to decode text.
 Note that this model is primarily aimed at being fine-tuned on tasks like Spanish image captioning.
@@ -25,7 +25,7 @@ img = wget.download("https://huggingface.co/streamlitiframe/flax-community/spani
 >>> clip_outputs = clip_processor(images=img)
 >>> clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
 >>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-es')
->>> model = FlaxCLIPVisionMarianMT.from_pretrained('flax-community/clip-vit-base-patch32_marian')
 >>> output_ids = model.generate(batch["pixel_values"], early_stopping=True, num_beams=4, max_length=64).sequences
 >>> output_string = tokenizer.batch_decode(output_ids.reshape(-1, 64), skip_special_tokens=True, max_length=64)
 >>> output_string
@@ -39,20 +39,20 @@ The Spanish image captioning model was trained on a subset of Conceptual 12M dat
 [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m), Introduced by Changpinyo et al. in [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981).
 ### Please update the dataset link here
-The translated dataset can be downloaded from [conceptual-12m-multilingual-marian](https://huggingface.co/datasets/flax-community/conceptual-12m-multilingual-marian-es). We do not provide images as we do not own any of them. One can download images from the `image_url` section of the original Conceptual 12M dataset.
 ## Data Cleaning 🧹
 Though the original dataset contains 12M image-text pairs, a lot of the URLs are invalid now, and in some cases, images are corrupt or broken. We remove such examples from our data, which leaves us with approximately 10M image-text pairs, out of which we took only 2.5M image, caption pairs.
 #### **Train set:**
 Total data: <br>
-2500000 captions  <br>
-2500000 images <br>
 #### **Validation set**
 Total data: <br>
 25000 captions  <br>
-25000 images <br><br>
 ## Training procedure 👨🏻‍💻

 # CLIP-Vision-Marian Seq2Seq Encoder-Decoder Model
+Pretrained CLIP-Vision-Marian pre-trained on a subset of Spanish-translated Conceptual-12M image-text pairs using a seq2seq model training objective. 2.5M cleaned English image-text pairs are translated using Spanish Marian Model. We trained CLIP-Vision-Marian model during community week hosted by Huggingface 🤗 using JAX/Flax.
 ## Model description
+CLIP-Vision-Marian is a modified transformers model which takes in visual embeddings from CLIP-Vision transformer and feeds into the `encoder_hidden_states` of a Marian decoder. This is done for deep cross-modal interaction via `cross-attention` between the two modes. The decoder then predicts logits for the `input_ids` provided and can be used for generation.
 ## Intended uses & limitations❗️
+You can use the raw model for encoder-decoder network where you want the encoder to encode images and the decoder to decode text.
 Note that this model is primarily aimed at being fine-tuned on tasks like Spanish image captioning.
 >>> clip_outputs = clip_processor(images=img)
 >>> clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
 >>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-es')
+>>> model = FlaxCLIPVisionMarianMT.from_pretrained('flax-community/clip-vit-base-patch32_marian-es')
 >>> output_ids = model.generate(batch["pixel_values"], early_stopping=True, num_beams=4, max_length=64).sequences
 >>> output_string = tokenizer.batch_decode(output_ids.reshape(-1, 64), skip_special_tokens=True, max_length=64)
 >>> output_string
 [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m), Introduced by Changpinyo et al. in [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981).
 ### Please update the dataset link here
+The translated dataset can be downloaded from [conceptual-12m-multilingual-marian](https://huggingface.co/datasets/flax-community/conceptual-12m-multilingual-marian). We do not provide images as we do not own any of them. One can download images from the `image_url` section of the original Conceptual 12M dataset.
 ## Data Cleaning 🧹
 Though the original dataset contains 12M image-text pairs, a lot of the URLs are invalid now, and in some cases, images are corrupt or broken. We remove such examples from our data, which leaves us with approximately 10M image-text pairs, out of which we took only 2.5M image, caption pairs.
 #### **Train set:**
 Total data: <br>
+2475000 captions  <br>
+2475000 images <br>
 #### **Validation set**
 Total data: <br>
 25000 captions  <br>
+25000 images <br>
 ## Training procedure 👨🏻‍💻