gchhablani
commited on
Commit
•
7a6097b
1
Parent(s):
6929d06
Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,12 @@
|
|
1 |
# CLIP-Vision-Marian Seq2Seq Encoder-Decoder Model
|
2 |
|
3 |
-
Pretrained CLIP-Vision-Marian pre-trained on subset of Spanish
|
4 |
|
5 |
## Model description
|
6 |
-
CLIP-Vision-Marian is a modified
|
7 |
|
8 |
## Intended uses & limitations❗️
|
9 |
-
You can use the raw model for encoder
|
10 |
|
11 |
Note that this model is primarily aimed at being fine-tuned on tasks like Spanish image captioning.
|
12 |
|
@@ -25,7 +25,7 @@ img = wget.download("https://huggingface.co/streamlitiframe/flax-community/spani
|
|
25 |
>>> clip_outputs = clip_processor(images=img)
|
26 |
>>> clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
|
27 |
>>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-es')
|
28 |
-
>>> model = FlaxCLIPVisionMarianMT.from_pretrained('flax-community/clip-vit-base-patch32_marian')
|
29 |
>>> output_ids = model.generate(batch["pixel_values"], early_stopping=True, num_beams=4, max_length=64).sequences
|
30 |
>>> output_string = tokenizer.batch_decode(output_ids.reshape(-1, 64), skip_special_tokens=True, max_length=64)
|
31 |
>>> output_string
|
@@ -39,20 +39,20 @@ The Spanish image captioning model was trained on a subset of Conceptual 12M dat
|
|
39 |
[Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m), Introduced by Changpinyo et al. in [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981).
|
40 |
|
41 |
### Please update the dataset link here
|
42 |
-
The translated dataset can be downloaded from [conceptual-12m-multilingual-marian](https://huggingface.co/datasets/flax-community/conceptual-12m-multilingual-marian
|
43 |
|
44 |
## Data Cleaning 🧹
|
45 |
Though the original dataset contains 12M image-text pairs, a lot of the URLs are invalid now, and in some cases, images are corrupt or broken. We remove such examples from our data, which leaves us with approximately 10M image-text pairs, out of which we took only 2.5M image, caption pairs.
|
46 |
|
47 |
#### **Train set:**
|
48 |
Total data: <br>
|
49 |
-
|
50 |
-
|
51 |
|
52 |
#### **Validation set**
|
53 |
Total data: <br>
|
54 |
25000 captions <br>
|
55 |
-
25000 images <br
|
56 |
|
57 |
|
58 |
## Training procedure 👨🏻💻
|
|
|
1 |
# CLIP-Vision-Marian Seq2Seq Encoder-Decoder Model
|
2 |
|
3 |
+
Pretrained CLIP-Vision-Marian pre-trained on a subset of Spanish-translated Conceptual-12M image-text pairs using a seq2seq model training objective. 2.5M cleaned English image-text pairs are translated using Spanish Marian Model. We trained CLIP-Vision-Marian model during community week hosted by Huggingface 🤗 using JAX/Flax.
|
4 |
|
5 |
## Model description
|
6 |
+
CLIP-Vision-Marian is a modified transformers model which takes in visual embeddings from CLIP-Vision transformer and feeds into the `encoder_hidden_states` of a Marian decoder. This is done for deep cross-modal interaction via `cross-attention` between the two modes. The decoder then predicts logits for the `input_ids` provided and can be used for generation.
|
7 |
|
8 |
## Intended uses & limitations❗️
|
9 |
+
You can use the raw model for encoder-decoder network where you want the encoder to encode images and the decoder to decode text.
|
10 |
|
11 |
Note that this model is primarily aimed at being fine-tuned on tasks like Spanish image captioning.
|
12 |
|
|
|
25 |
>>> clip_outputs = clip_processor(images=img)
|
26 |
>>> clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
|
27 |
>>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-es')
|
28 |
+
>>> model = FlaxCLIPVisionMarianMT.from_pretrained('flax-community/clip-vit-base-patch32_marian-es')
|
29 |
>>> output_ids = model.generate(batch["pixel_values"], early_stopping=True, num_beams=4, max_length=64).sequences
|
30 |
>>> output_string = tokenizer.batch_decode(output_ids.reshape(-1, 64), skip_special_tokens=True, max_length=64)
|
31 |
>>> output_string
|
|
|
39 |
[Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m), Introduced by Changpinyo et al. in [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981).
|
40 |
|
41 |
### Please update the dataset link here
|
42 |
+
The translated dataset can be downloaded from [conceptual-12m-multilingual-marian](https://huggingface.co/datasets/flax-community/conceptual-12m-multilingual-marian). We do not provide images as we do not own any of them. One can download images from the `image_url` section of the original Conceptual 12M dataset.
|
43 |
|
44 |
## Data Cleaning 🧹
|
45 |
Though the original dataset contains 12M image-text pairs, a lot of the URLs are invalid now, and in some cases, images are corrupt or broken. We remove such examples from our data, which leaves us with approximately 10M image-text pairs, out of which we took only 2.5M image, caption pairs.
|
46 |
|
47 |
#### **Train set:**
|
48 |
Total data: <br>
|
49 |
+
2475000 captions <br>
|
50 |
+
2475000 images <br>
|
51 |
|
52 |
#### **Validation set**
|
53 |
Total data: <br>
|
54 |
25000 captions <br>
|
55 |
+
25000 images <br>
|
56 |
|
57 |
|
58 |
## Training procedure 👨🏻💻
|