Spaces:

medmac01
/

stable-diff-multilingual-v0.1

Sleeping

App Files Files Community

medmac01 commited on Mar 7, 2024

Commit

3bd5293

1 Parent(s): 9f5fbfb

Added multilingual_clip module

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
Multilingual_CLIP/HISTORY.md +39 -0
Multilingual_CLIP/Images/Multilingual-CLIP.png +0 -0
Multilingual_CLIP/Images/Orange Apple.png +0 -0
Multilingual_CLIP/Images/Smile.jpg +0 -0
Multilingual_CLIP/Images/bananas.jpg +0 -0
Multilingual_CLIP/Images/fruit bowl.jpg +0 -0
Multilingual_CLIP/Images/green apple.jpg +0 -0
Multilingual_CLIP/Images/happy person.jpg +0 -0
Multilingual_CLIP/Images/man on bike.jpg +0 -0
Multilingual_CLIP/Images/purple apple.png +0 -0
Multilingual_CLIP/Images/red apple.jpg +0 -0
Multilingual_CLIP/Images/sad.jpg +0 -0
Multilingual_CLIP/LICENSE +21 -0
Multilingual_CLIP/Makefile +3 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Fine-Tune-Languages.md +42 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/French-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/German-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Greek-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Kannada-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/M-Swedish-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Russian-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Spanish-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base 69/README.md +74 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Fine-Tune-Languages.md +42 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/French-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/German-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Greek-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Kannada-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/M-Swedish-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Russian-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Spanish-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/README.md +74 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Fine-Tune-Languages.md +42 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/French-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/German-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Greek-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Kannada-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/M-Swedish-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Russian-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Spanish-Both.png +0 -0
Multilingual_CLIP/Model Cards/M-BERT Distil 40/README.md +72 -0
Multilingual_CLIP/Model Cards/Swe-CLIP 2M/README.md +29 -0
Multilingual_CLIP/Model Cards/Swe-CLIP 500k/README.md +29 -0
Multilingual_CLIP/Multilingual_CLIP.ipynb +0 -0
Multilingual_CLIP/README.md +236 -0
Multilingual_CLIP/inference_example.py +34 -0
Multilingual_CLIP/larger_mclip.md +60 -0
Multilingual_CLIP/legacy_get-weights.sh +20 -0
Multilingual_CLIP/legacy_inference.py +13 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

Multilingual_CLIP/HISTORY.md ADDED Viewed

	@@ -0,0 +1,39 @@

+## 1.0.10
+* it works
+## 1.0.8
+* small fix
+## 1.0.7
+* small fix
+## 1.0.6
+* small fix
+## 1.0.5
+* small fix
+## 1.0.4
+* small fix
+## 1.0.3
+* rename all mentions to multilingual_clip
+## 1.0.2
+* Multilingual-clip
+## 1.0.1
+* name it m-clip
+## 1.0.0
+* first pypi release of multilingual_clip

Multilingual_CLIP/Images/Multilingual-CLIP.png ADDED Viewed

Multilingual_CLIP/Images/Orange Apple.png ADDED Viewed

Multilingual_CLIP/Images/Smile.jpg ADDED Viewed

Multilingual_CLIP/Images/bananas.jpg ADDED Viewed

Multilingual_CLIP/Images/fruit bowl.jpg ADDED Viewed

Multilingual_CLIP/Images/green apple.jpg ADDED Viewed

Multilingual_CLIP/Images/happy person.jpg ADDED Viewed

Multilingual_CLIP/Images/man on bike.jpg ADDED Viewed

Multilingual_CLIP/Images/purple apple.png ADDED Viewed

Multilingual_CLIP/Images/red apple.jpg ADDED Viewed

Multilingual_CLIP/Images/sad.jpg ADDED Viewed

Multilingual_CLIP/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Fredrik Carlsson
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

Multilingual_CLIP/Makefile ADDED Viewed

	@@ -0,0 +1,3 @@

+install: ## [Local development] Upgrade pip, install requirements, install package.
+	python -m pip install -U pip
+	python -m pip install -e .

Multilingual_CLIP/Model Cards/M-BERT Base 69/Fine-Tune-Languages.md ADDED Viewed

	@@ -0,0 +1,42 @@

+### List of languages included during CLIP fine-tuning
+* Albanian
+* Amharic
+* Arabic
+* Azerbaijani
+* Bengali
+* Bulgarian
+* Catalan
+* Chinese (Simplified)
+* Chinese (Traditional)
+* Dutch
+* English
+* Estonian
+* Farsi
+* French
+* Georgian
+* German
+* Greek
+* Hindi
+* Hungarian
+* Icelandic
+* Indonesian
+* Italian
+* Japanese
+* Kazakh
+* Korean
+* Latvian
+* Macedonian
+* Malay
+* Pashto
+* Polish
+* Romanian
+* Russian
+* Slovenian
+* Spanish
+* Swedish
+* Tagalog
+* Thai
+* Turkish
+* Urdu
+* Vietnamese

Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/French-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/German-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Greek-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Kannada-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/M-Swedish-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Russian-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base 69/Images/Spanish-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base 69/README.md ADDED Viewed

	@@ -0,0 +1,74 @@

+<br />
+<p align="center">
+  <h1 align="center">M-BERT Base 69</h1>
+  <p align="center">
+    <a href="https://huggingface.co/M-CLIP/M-BERT-Base-69">Huggingface Model</a>
+    ·
+    <a href="https://huggingface.co/bert-base-multilingual-cased">Huggingface Base Model</a>
+  </p>
+</p>
+## Usage
+To use this model along with the original CLIP vision encoder follow the [main page usage instructions](https://github.com/FreddeFrallan/Multilingual-CLIP) to download the additional linear weights.
+Once this is done, you can load and use the model with the following code
+```python
+from multilingual_clip import multilingual_clip
+model = multilingual_clip.load_model('M-BERT-Base-69')
+embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
+print(embeddings.shape)
+# Yields: torch.Size([3, 640])
+```
+<!-- ABOUT THE PROJECT -->
+## About
+A [bert-base-multilingual](https://huggingface.co/bert-base-multilingual-cased) tuned to match the embedding space for 69 languages, to the embedding space of the CLIP text encoder which accompanies the Res50x4 vision encoder. <br>
+A full list of the 100 languages used during pre-training can be found [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages), and a list of the 69 languages used during fine-tuning can be found in [SupportedLanguages.md](Fine-Tune-Languages.md).
+Training data pairs was generated by sampling 40k sentences for each language from the combined descriptions of [GCC](https://ai.google.com/research/ConceptualCaptions/) + [MSCOCO](https://cocodataset.org/#home) + [VizWiz](https://vizwiz.org/tasks-and-datasets/image-captioning/), and translating them into the corresponding language.
+All translation was done using the [AWS translate service](https://aws.amazon.com/translate/), the quality of these translations have currently not been analyzed, but one can assume the quality varies between the 40 languages.
+<!---
+## Evaluation
+A non-rigorous qualitative evaluation shows that for the languages French, German, Spanish, Russian, Swedish and Greek it seemingly yields respectable results for most instances. The exception being that Greeks are apparently unable to recognize happy persons. <br>
+When testing on Kannada, a language which was included during pre-training but not fine-tuning, it performed close to random
+<!---
+The qualitative test was organized into two sets of images and their corresponding text descriptions. The texts were manually translated into each different test languages, where the two sets include the following images:
+#### Set Nr 1
+* A man on a motorcycle
+* A green apple
+* A bowl of fruits
+* A bunch of bananas hanging from a tree
+* A happy person laughing/smiling
+* A sad person crying
+#### Set Nr 2
+The second set included only images of fruits, and non-realistic photoshopped images, in an attempt to increase the difficulty.
+* A green apple
+* A red apple
+* A purple apple (photoshopped)
+* A orange apple (photoshopped)
+* A bowl of fruits
+* A bunch of bananas hanging from a tree
+<!---
+### Results
+The results depicted below are formatted so that each <b>column</b> represents the Softmax prediction over all the texts given the corresponding image. The images and matchings texts are ordered identically, hence a perfect solution would have 100 across the diagonal.
+<!---
+#### French
+![Alt](Images/French-Both.png)
+#### German
+![Alt](Images/German-Both.png)
+#### Spanish
+![Alt](Images/Spanish-Both.png)
+#### Russian
+![Alt](Images/Russian-Both.png)
+#### Swedish
+![Alt](Images/M-Swedish-Both.png)
+#### Greek
+![Alt](Images/Greek-Both.png)
+#### Kannada
+Kannada was <b>not included</b> in the 40 fine-tuning languages, but included during language modelling pre-training
+![Alt](Images/Kannada-Both.png)

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Fine-Tune-Languages.md ADDED Viewed

	@@ -0,0 +1,42 @@

+### List of languages included during CLIP fine-tuning
+* Albanian
+* Amharic
+* Arabic
+* Azerbaijani
+* Bengali
+* Bulgarian
+* Catalan
+* Chinese (Simplified)
+* Chinese (Traditional)
+* Dutch
+* English
+* Estonian
+* Farsi
+* French
+* Georgian
+* German
+* Greek
+* Hindi
+* Hungarian
+* Icelandic
+* Indonesian
+* Italian
+* Japanese
+* Kazakh
+* Korean
+* Latvian
+* Macedonian
+* Malay
+* Pashto
+* Polish
+* Romanian
+* Russian
+* Slovenian
+* Spanish
+* Swedish
+* Tagalog
+* Thai
+* Turkish
+* Urdu
+* Vietnamese

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/French-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/German-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Greek-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Kannada-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/M-Swedish-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Russian-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/Images/Spanish-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Base ViT-B/README.md ADDED Viewed

	@@ -0,0 +1,74 @@

+<br />
+<p align="center">
+  <h1 align="center">M-BERT Base ViT-B</h1>
+  <p align="center">
+    <a href="https://huggingface.co/M-CLIP/M-BERT-Base-ViT-B">Huggingface Model</a>
+    ·
+    <a href="https://huggingface.co/bert-base-multilingual-cased">Huggingface Base Model</a>
+  </p>
+</p>
+## Usage
+To use this model along with the original CLIP vision encoder follow the [main page usage instructions](https://github.com/FreddeFrallan/Multilingual-CLIP) to download the additional linear weights.
+Once this is done, you can load and use the model with the following code
+```python
+from multilingual_clip import multilingual_clip
+model = multilingual_clip.load_model('M-BERT-Base-ViT-B')
+embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
+print(embeddings.shape)
+# Yields: torch.Size([3, 640])
+```
+<!-- ABOUT THE PROJECT -->
+## About
+A [bert-base-multilingual](https://huggingface.co/bert-base-multilingual-cased) tuned to match the embedding space for 69 languages, to the embedding space of the CLIP text encoder which accompanies the Res50x4 vision encoder. <br>
+A full list of the 100 languages used during pre-training can be found [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages), and a list of the 69 languages used during fine-tuning can be found in [SupportedLanguages.md](Fine-Tune-Languages.md).
+Training data pairs was generated by sampling 40k sentences for each language from the combined descriptions of [GCC](https://ai.google.com/research/ConceptualCaptions/) + [MSCOCO](https://cocodataset.org/#home) + [VizWiz](https://vizwiz.org/tasks-and-datasets/image-captioning/), and translating them into the corresponding language.
+All translation was done using the [AWS translate service](https://aws.amazon.com/translate/), the quality of these translations have currently not been analyzed, but one can assume the quality varies between the 40 languages.
+<!---
+## Evaluation
+A non-rigorous qualitative evaluation shows that for the languages French, German, Spanish, Russian, Swedish and Greek it seemingly yields respectable results for most instances. The exception being that Greeks are apparently unable to recognize happy persons. <br>
+When testing on Kannada, a language which was included during pre-training but not fine-tuning, it performed close to random
+<!---
+The qualitative test was organized into two sets of images and their corresponding text descriptions. The texts were manually translated into each different test languages, where the two sets include the following images:
+#### Set Nr 1
+* A man on a motorcycle
+* A green apple
+* A bowl of fruits
+* A bunch of bananas hanging from a tree
+* A happy person laughing/smiling
+* A sad person crying
+#### Set Nr 2
+The second set included only images of fruits, and non-realistic photoshopped images, in an attempt to increase the difficulty.
+* A green apple
+* A red apple
+* A purple apple (photoshopped)
+* A orange apple (photoshopped)
+* A bowl of fruits
+* A bunch of bananas hanging from a tree
+<!---
+### Results
+The results depicted below are formatted so that each <b>column</b> represents the Softmax prediction over all the texts given the corresponding image. The images and matchings texts are ordered identically, hence a perfect solution would have 100 across the diagonal.
+<!---
+#### French
+![Alt](Images/French-Both.png)
+#### German
+![Alt](Images/German-Both.png)
+#### Spanish
+![Alt](Images/Spanish-Both.png)
+#### Russian
+![Alt](Images/Russian-Both.png)
+#### Swedish
+![Alt](Images/M-Swedish-Both.png)
+#### Greek
+![Alt](Images/Greek-Both.png)
+#### Kannada
+Kannada was <b>not included</b> in the 40 fine-tuning languages, but included during language modelling pre-training
+![Alt](Images/Kannada-Both.png)

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Fine-Tune-Languages.md ADDED Viewed

	@@ -0,0 +1,42 @@

+### List of languages included during CLIP fine-tuning
+* Albanian
+* Amharic
+* Arabic
+* Azerbaijani
+* Bengali
+* Bulgarian
+* Catalan
+* Chinese (Simplified)
+* Chinese (Traditional)
+* Dutch
+* English
+* Estonian
+* Farsi
+* French
+* Georgian
+* German
+* Greek
+* Hindi
+* Hungarian
+* Icelandic
+* Indonesian
+* Italian
+* Japanese
+* Kazakh
+* Korean
+* Latvian
+* Macedonian
+* Malay
+* Pashto
+* Polish
+* Romanian
+* Russian
+* Slovenian
+* Spanish
+* Swedish
+* Tagalog
+* Thai
+* Turkish
+* Urdu
+* Vietnamese

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/French-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/German-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Greek-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Kannada-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/M-Swedish-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Russian-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Distil 40/Images/Spanish-Both.png ADDED Viewed

Multilingual_CLIP/Model Cards/M-BERT Distil 40/README.md ADDED Viewed

	@@ -0,0 +1,72 @@

+<br />
+<p align="center">
+  <h1 align="center">M-BERT Distil 40</h1>
+  <p align="center">
+    <a href="https://huggingface.co/M-CLIP/M-BERT-Distil-40">Huggingface Model</a>
+    ·
+    <a href="https://huggingface.co/distilbert-base-multilingual-cased">Huggingface Base Model</a>
+  </p>
+</p>
+## Usage
+To use this model along with the original CLIP vision encoder follow the [main page usage instructions](https://github.com/FreddeFrallan/Multilingual-CLIP) to download the additional linear weights.
+Once this is done, you can load and use the model with the following code
+```python
+from multilingual_clip import multilingual_clip
+model = multilingual_clip.load_model('M-BERT-Distil-40')
+embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
+print(embeddings.shape)
+# Yields: torch.Size([3, 640])
+```
+<!-- ABOUT THE PROJECT -->
+## About
+A [distilbert-base-multilingual](https://huggingface.co/distilbert-base-multilingual-cased) tuned to match the embedding space for 40 languages, to the embedding space of the CLIP text encoder which accompanies the Res50x4 vision encoder. <br>
+A full list of the 100 languages used during pre-training can be found [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages), and a list of the 40 languages used during fine-tuning can be found in [SupportedLanguages.md](Fine-Tune-Languages.md).
+Training data pairs was generated by sampling 40k sentences for each language from the combined descriptions of [GCC](https://ai.google.com/research/ConceptualCaptions/) + [MSCOCO](https://cocodataset.org/#home) + [VizWiz](https://vizwiz.org/tasks-and-datasets/image-captioning/), and translating them into the corresponding language.
+All translation was done using the [AWS translate service](https://aws.amazon.com/translate/), the quality of these translations have currently not been analyzed, but one can assume the quality varies between the 40 languages.
+## Evaluation
+A non-rigorous qualitative evaluation shows that for the languages French, German, Spanish, Russian, Swedish and Greek it seemingly yields respectable results for most instances. The exception being that Greeks are apparently unable to recognize happy persons. <br>
+When testing on Kannada, a language which was included during pre-training but not fine-tuning, it performed close to random
+The qualitative test was organized into two sets of images and their corresponding text descriptions. The texts were manually translated into each different test languages, where the two sets include the following images:
+#### Set Nr 1
+* A man on a motorcycle
+* A green apple
+* A bowl of fruits
+* A bunch of bananas hanging from a tree
+* A happy person laughing/smiling
+* A sad person crying
+#### Set Nr 2
+The second set included only images of fruits, and non-realistic photoshopped images, in an attempt to increase the difficulty.
+* A green apple
+* A red apple
+* A purple apple (photoshopped)
+* A orange apple (photoshopped)
+* A bowl of fruits
+* A bunch of bananas hanging from a tree
+### Results
+The results depicted below are formatted so that each <b>column</b> represents the Softmax prediction over all the texts given the corresponding image. The images and matchings texts are ordered identically, hence a perfect solution would have 100 across the diagonal.
+#### French
+![Alt](Images/French-Both.png)
+#### German
+![Alt](Images/German-Both.png)
+#### Spanish
+![Alt](Images/Spanish-Both.png)
+#### Russian
+![Alt](Images/Russian-Both.png)
+#### Swedish
+![Alt](Images/M-Swedish-Both.png)
+#### Greek
+![Alt](Images/Greek-Both.png)
+#### Kannada
+Kannada was <b>not included</b> in the 40 fine-tuning languages, but included during language modelling pre-training
+![Alt](Images/Kannada-Both.png)

Multilingual_CLIP/Model Cards/Swe-CLIP 2M/README.md ADDED Viewed

	@@ -0,0 +1,29 @@

+<br />
+<p align="center">
+  <h1 align="center">Swe-CLIP 2M</h1>
+  <p align="center">
+    <a href="https://huggingface.co/M-CLIP/Swedish-2M">Huggingface Model</a>
+    ·
+    <a href="https://huggingface.co/KB/bert-base-swedish-cased">Huggingface Base Model</a>
+  </p>
+</p>
+## Usage
+To use this model along with the original CLIP vision encoder follow the [main page usage instructions](https://github.com/FreddeFrallan/Multilingual-CLIP) to download the additional linear weights.
+Once this is done, you can load and use the model with the following code
+```python
+from multilingual_clip import multilingual_clip
+model = multilingual_clip.load_model('Swe-CLIP-2M')
+embeddings = model(['Älgen är skogens konung!', 'Alla isbjörnar är vänsterhänta'])
+print(embeddings.shape)
+# Yields: torch.Size([2, 640])
+```
+<!-- ABOUT THE PROJECT -->
+## About
+A [KB/Bert-Swedish-Cased](https://huggingface.co/KB/bert-base-swedish-cased) tuned to match the embedding space of the CLIP text encoder which accompanies the Res50x4 vision encoder. <br>
+Training data pairs was generated by sampling 2 Million sentences from the combined descriptions of [GCC](https://ai.google.com/research/ConceptualCaptions/) + [MSCOCO](https://cocodataset.org/#home) + [VizWiz](https://vizwiz.org/tasks-and-datasets/image-captioning/), and translating them into Swedish.
+All translation was done using the [Huggingface Opus Model](https://huggingface.co/Helsinki-NLP/opus-mt-en-sv), which seemingly procudes higher quality translations than relying on the [AWS translate service](https://aws.amazon.com/translate/).

Multilingual_CLIP/Model Cards/Swe-CLIP 500k/README.md ADDED Viewed

	@@ -0,0 +1,29 @@

+<br />
+<p align="center">
+  <h1 align="center">Swe-CLIP 500k</h1>
+  <p align="center">
+    <a href="https://huggingface.co/M-CLIP/Swedish-500k">Huggingface Model</a>
+    ·
+    <a href="https://huggingface.co/KB/bert-base-swedish-cased">Huggingface Base Model</a>
+  </p>
+</p>
+## Usage
+To use this model along with the original CLIP vision encoder follow the [main page usage instructions](https://github.com/FreddeFrallan/Multilingual-CLIP) to download the additional linear weights.
+Once this is done, you can load and use the model with the following code
+```python
+from multilingual_clip import multilingual_clip
+model = multilingual_clip.load_model('Swe-CLIP-500k')
+embeddings = model(['Älgen är skogens konung!', 'Alla isbjörnar är vänsterhänta'])
+print(embeddings.shape)
+# Yields: torch.Size([2, 640])
+```
+<!-- ABOUT THE PROJECT -->
+## About
+A [KB/Bert-Swedish-Cased](https://huggingface.co/KB/bert-base-swedish-cased) tuned to match the embedding space of the CLIP text encoder which accompanies the Res50x4 vision encoder. <br>
+Training data pairs was generated by sampling 500k sentences from the combined descriptions of [GCC](https://ai.google.com/research/ConceptualCaptions/) + [MSCOCO](https://cocodataset.org/#home) + [VizWiz](https://vizwiz.org/tasks-and-datasets/image-captioning/), and translating them into Swedish.
+All translation was done using the [Huggingface Opus Model](https://huggingface.co/Helsinki-NLP/opus-mt-en-sv), which seemingly procudes higher quality translations than relying on the [AWS translate service](https://aws.amazon.com/translate/).

Multilingual_CLIP/Multilingual_CLIP.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Multilingual_CLIP/README.md ADDED Viewed

	@@ -0,0 +1,236 @@

+<br />
+<p align="center">
+  <h1 align="center">Multilingual-CLIP</h1>
+  <h3 align="center">OpenAI CLIP text encoders for any language</h3>
+  <p align="center">
+    <a href="https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn5.laion.ai&index=laion_400m&useMclip=true">Live Demo</a>
+    ·
+    <a href="https://huggingface.co/M-CLIP">Pre-trained Models</a>
+    ·
+    <a href="https://github.com/FreddeFrallan/Contrastive-Tension/issues">Report Bug</a>
+  </p>
+</p>
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FreddeFrallan/Multilingual-CLIP/blob/master/Multilingual_CLIP.ipynb)
+[![pypi](https://img.shields.io/pypi/v/multilingual-clip.svg)](https://pypi.python.org/pypi/multilingual-clip)
+<!-- ABOUT THE PROJECT -->
+## Overview
+![Alt text](Images/Multilingual-CLIP.png?raw=true "Title")
+[OpenAI](https://openai.com/) recently released the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective.
+CLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions.
+OpenAI has since released a set of their smaller CLIP models, which can be found on the [official CLIP Github](https://github.com/openai/CLIP).
+## Demo
+A live demonstration of multilingual Text-Image retrieval using M-CLIP can be found [here!](https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn5.laion.ai&index=laion_400m&useMclip=true) This demo was created by [Rom1504](https://github.com/rom1504), and it allows you to search the LAION-400M dataset in various languages using M-CLIP.
+#### This repository contains
+* Pre-trained CLIP-Text encoders for multiple languages
+* Pytorch & Tensorflow inference code
+* Tensorflow training code
+### Requirements
+While it is possible that other versions works equally fine, we have worked with the following:
+* Python = 3.6.9
+* Transformers = 4.8.1
+## Install
+`pip install multilingual-clip torch`
+You can also choose to `pip install tensorflow` instead of torch.
+## Inference Usage
+Inference code for Tensorflow is also available in [inference_example.py](https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/inference_example.py)
+```python
+from multilingual_clip import pt_multilingual_clip
+import transformers
+texts = [
+    'Three blind horses listening to Mozart.',
+    'Älgen är skogens konung!',
+    'Wie leben Eisbären in der Antarktis?',
+    'Вы знали, что все белые медведи левши?'
+]
+model_name = 'M-CLIP/XLM-Roberta-Large-Vit-L-14'
+# Load Model & Tokenizer
+model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
+embeddings = model.forward(texts, tokenizer)
+print(embeddings.shape)
+```
+## Install for development
+Setup a virtualenv:
+```
+python3 -m venv .env
+source .env/bin/activate
+pip install -e .
+```
+## Pre-trained Models
+Every text encoder is a [Huggingface](https://huggingface.co/) available transformer, with an additional linear layer on top. For more information of a specific model, click the Model Name to see its model card.
+<br>
+<br>
+| Name |Model Base|Vision Model | Vision Dimensions | Pre-trained Languages | #Parameters|
+| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |
+| [LABSE Vit-L/14](https://huggingface.co/M-CLIP/LABSE-Vit-L-14)| [LaBSE](https://huggingface.co/sentence-transformers/LaBSE)|  [OpenAI ViT-L/14](https://github.com/openai/CLIP) | 768 | [109 Languages](https://arxiv.org/pdf/2007.01852.pdf) | 110 M|
+| [XLM-R Large Vit-B/32](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-32)| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)|  [OpenAI ViT-B/32](https://github.com/openai/CLIP) | 512 | [100 Languages](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#Introduction) | 344 M|
+| [XLM-R Large Vit-L/14](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-L-14)| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)|  [OpenAI ViT-L/14](https://github.com/openai/CLIP) | 768 | [100 Languages](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#Introduction)|  344 M|
+| [XLM-R Large Vit-B/16+](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus)| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)|  [Open CLIP ViT-B-16-plus-240](https://github.com/mlfoundations/open_clip) | 640 | [100 Languages](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#Introduction)| 344 M|
+### Validation & Training Curves
+Following is a table of the <b>Txt2Img @10-Recal</b> for the humanly tanslated [MS-COCO testset](https://arxiv.org/abs/2109.07622).
+| Name | En | De | Es | Fr | Zh | It | Pl | Ko | Ru | Tr | Jp |
+| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |:-----: |:-----: |:-----: |:-----: |:-----: |:-----: |
+| [OpenAI CLIP Vit-B/32](https://github.com/openai/CLIP)| 90.3 | - | - | - | - | - | - | - | - | - | - |
+| [OpenAI CLIP Vit-L/14](https://github.com/openai/CLIP)| 91.8 | - | - | - | - | - | - | - | - | - | - |
+| [OpenCLIP ViT-B-16+-](https://github.com/openai/CLIP)| 94.3 | - | - | - | - | - | - | - | - | - | - |
+| [LABSE Vit-L/14](https://huggingface.co/M-CLIP/LABSE-Vit-L-14)| 91.6 | 89.6 | 89.5 | 89.9 | 88.9 | 90.1 | 89.8 | 80.8 | 85.5 | 89.8 | 73.9 |
+| [XLM-R Large Vit-B/32](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-32)| 91.8 | 88.7 | 89.1 | 89.4 | 89.3 | 89.8| 91.4 | 82.1 | 86.1 | 88.8 | 81.0 |
+| [XLM-R Vit-L/14](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-L-14)| 92.4 | 90.6 | 91.0 | 90.0 | 89.7 | 91.1 | 91.3 | 85.2 | 85.8 | 90.3 | 81.9 |
+| [XLM-R Large Vit-B/16+](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus)| <b>95.0</b> | <b>93.0</b> | <b>93.6</b> | <b>93.1</b> | <b>94.0</b> | <b>93.1</b> | <b>94.4</b> | <b>89.0</b> | <b>90.0</b> | <b>93.0</b> | <b>84.2</b> |
+The training curves for these models are available at this [Weights and Biases Report](https://wandb.ai/freddefrallan/M-CLIP/reports/M-CLIP-2-6-2022--VmlldzoyMTE1MjU1/edit?firstReport&runsetFilter), the results for other non-succesfull and ongoing experiments can be found in the [Weights and Biases Project](https://wandb.ai/freddefrallan/M-CLIP?workspace=user-freddefrallan).
+## Legacy Usage and Models
+Older versions of M-CLIP had the linear weights stored separately from Huggingface. Whilst the new models have them directly incorporated in the Huggingface repository. More information about these older models can be found in this section.
+<details>
+  <summary>Click for more information</summary>
+##### Download CLIP Model
+```bash
+$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
+$ pip install ftfy regex tqdm
+$ pip install git+https://github.com/openai/CLIP.git
+```
+Replace `cudatoolkit=11.0` above with the appropriate CUDA version on your machine or `cpuonly` when installing on a machine without a GPU.
+For more information please see the official [CLIP repostitory](https://github.com/openai/CLIP).
+##### Download Linear Weights
+```bash
+# Linear Model Weights
+$ bash legacy_get-weights.sh
+```
+### Inference
+```python
+from multilingual_clip import multilingual_clip
+print(multilingual_clip.AVAILABLE_MODELS.keys())
+model = multilingual_clip.load_model('M-BERT-Distil-40')
+embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
+print(embeddings.shape)
+# Yields: torch.Size([3, 640])
+```
+<!--- For a more elaborative example see this [Google Colab](https://colab.research.google.com/github/FreddeFrallan/Multilingual-CLIP/blob/master/Multilingual_CLIP.ipynb). --->
+For a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this [colab notebook](https://colab.research.google.com/github/FreddeFrallan/Multilingual-CLIP/blob/master/Multilingual_CLIP.ipynb).
+<!-- GETTING STARTED -->
+## Legacy Pre-trained Models
+Every text encoder is a [Huggingface](https://huggingface.co/) available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card.
+<br>
+<br>
+<b>*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. *** </b>
+| Name |Model Base|Vision Model | Pre-trained Languages | Target Languages | #Parameters|
+| ----------------------------------|:-----: |:-----: |:-----: |:-----: |:-----: |
+|**Multilingual**    ||
+| [M-BERT Distil 40](https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/Model%20Cards/M-BERT%20Distil%2040) | [M-BERT Distil](https://huggingface.co/bert-base-multilingual-uncased)|  RN50x4 | [101 Languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) | [40 Languages](https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/Model%20Cards/M-BERT%20Distil%2040/Fine-Tune-Languages.md) | 66 M|
+| [M-BERT Base 69](https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/Model%20Cards/M-BERT%20Base%2069) | [M-BERT Base](https://huggingface.co/bert-base-multilingual-uncased)|RN50x4 | [101 Languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) | 68 Languages | 110 M|
+| [M-BERT Base ViT-B](https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/Model%20Cards/M-BERT%20Base%20ViT-B) | [M-BERT Base](https://huggingface.co/bert-base-multilingual-uncased)|ViT-B/32 | [101 Languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) | 68 Languages | 110 M|
+|**Monolingual**    ||
+|[Swe-CLIP 500k](https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/Model%20Cards/Swe-CLIP%20500k)| [KB-BERT](https://huggingface.co/KB/bert-base-swedish-cased)|  RN50x4 | Swedish | Swedish | 110 M|
+|[Swe-CLIP 2M](https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/Model%20Cards/Swe-CLIP%202M)| [KB-BERT](https://huggingface.co/KB/bert-base-swedish-cased)|  RN50x4 | Swedish | Swedish | 110 M|
+  </details>
+## Training a new model
+[This folder](https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/multilingual_clip/TeacherLearning) contains the code used for training the above models. If you wsh to train your own model you must do the following things:
+* Prepare a set of translated sentence pairs from English -> Your Language(s)
+* Compute regular CLIP-Text embeddings for the English sentences.
+* Edit [Training.py](https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/multilingual_clip/TeacherLearning/Training.py) to load your data.
+* Train a new CLIP-Text encoder via Teacher Learning
+### Pre-computed CLIP Embeddings & Translaton Data
+[This Google Drive folder](https://drive.google.com/drive/folders/1I9a7naSZubUATWzLFv61DQMWyFlF7wR5?usp=sharing) contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of [GCC](https://ai.google.com/research/ConceptualCaptions/) + [MSCOCO](https://cocodataset.org/#home) + [VizWiz](https://vizwiz.org/tasks-and-datasets/image-captioning/).
+The Google Drive folder also contains the translation data used to train the currently available models.
+Good Luck
+## Contribution
+If you have trained a CLIP Text encoder specific to your language, or another model covering a language not supported here, Please feel free to contact us and we will either upload your model and credit you, or simply link to your already uploaded model.
+<!-- CONTACT -->
+## Contact
+If you have questions regarding the code or otherwise related to this Github page, please open an [issue](https://github.com/FreddeFrallan/Contrastive-Tension/issues).
+For other purposes, feel free to contact me directly at: Fredrik.Carlsson@ri.se
+<!-- ACKNOWLEDGEMENTS -->
+## Acknowledgements
+* [Stability.ai](https://stability.ai/) for providing much appreciated compute during training.
+* [CLIP](https://openai.com/blog/clip/)
+* [OpenAI](https://openai.com/)
+* [Huggingface](https://huggingface.co/)
+* [Best Readme Template](https://github.com/othneildrew/Best-README-Template)
+* ["Two Cats" Image by pl1602](https://search.creativecommons.org/photos/8dfd802b-58e5-4cc5-889d-96abba540de1)
+<!-- LICENSE -->
+## License
+Distributed under the MIT License. See `LICENSE` for more information.
+<!-- CITATION -->
+## Citing
+If you found this repository useful, please consider citing:
+```bibtex
+@InProceedings{carlsson-EtAl:2022:LREC,
+  author    = {Carlsson, Fredrik  and  Eisen, Philipp  and  Rekathati, Faton  and  Sahlgren, Magnus},
+  title     = {Cross-lingual and Multilingual CLIP},
+  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
+  month          = {June},
+  year           = {2022},
+  address        = {Marseille, France},
+  publisher      = {European Language Resources Association},
+  pages     = {6848--6854},
+  abstract  = {The long-standing endeavor of relating the textual and the visual domain recently underwent a pivotal breakthrough, as OpenAI released CLIP. This model distinguishes how well an English text corresponds with a given image with unprecedented accuracy. Trained via a contrastive learning objective over a huge dataset of 400M of images and captions, it is a work that is not easily replicated, especially for low resource languages. Capitalizing on the modularization of the CLIP architecture, we propose to use cross-lingual teacher learning to re-train the textual encoder for various non-English languages. Our method requires no image data and relies entirely on machine translation which removes the need for data in the target language. We find that our method can efficiently train a new textual encoder with relatively low computational cost, whilst still outperforming previous baselines on multilingual image-text retrieval.},
+  url       = {https://aclanthology.org/2022.lrec-1.739}
+}
+```
+<!-- MARKDOWN LINKS & IMAGES -->
+<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->
+[contributors-shield]: https://img.shields.io/github/contributors/othneildrew/Best-README-Template.svg?style=for-the-badge
+[contributors-url]: https://github.com/othneildrew/Best-README-Template/graphs/contributors
+[forks-shield]: https://img.shields.io/github/forks/othneildrew/Best-README-Template.svg?style=for-the-badge
+[forks-url]: https://github.com/othneildrew/Best-README-Template/network/members
+[stars-shield]: https://img.shields.io/github/stars/othneildrew/Best-README-Template.svg?style=for-the-badge
+[stars-url]: https://github.com/othneildrew/Best-README-Template/stargazers
+[issues-shield]: https://img.shields.io/github/issues/othneildrew/Best-README-Template.svg?style=for-the-badge
+[issues-url]: https://github.com/othneildrew/Best-README-Template/issues
+[license-shield]: https://img.shields.io/github/license/othneildrew/Best-README-Template.svg?style=for-the-badge
+[license-url]: https://github.com/othneildrew/Best-README-Template/blob/master/LICENSE.txt
+[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
+[linkedin-url]: https://linkedin.com/in/othneildrew
+[product-screenshot]: images/screenshot.png

Multilingual_CLIP/inference_example.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import transformers
+def tf_example(texts, model_name='M-CLIP/XLM-Roberta-Large-Vit-L-14'):
+    from multilingual_clip import tf_multilingual_clip
+    model = tf_multilingual_clip.MultiLingualCLIP.from_pretrained(model_name)
+    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
+    inData = tokenizer.batch_encode_plus(texts, return_tensors='tf', padding=True)
+    embeddings = model(inData)
+    print(embeddings.shape)
+def pt_example(texts, model_name='M-CLIP/XLM-Roberta-Large-Vit-L-14'):
+    from multilingual_clip import pt_multilingual_clip
+    model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
+    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
+    embeddings = model.forward(texts, tokenizer)
+    print(embeddings.shape)
+if __name__ == '__main__':
+    exampleTexts = [
+        'Three blind horses listening to Mozart.',
+        'Älgen är skogens konung!',
+        'Wie leben Eisbären in der Antarktis?',
+        'Вы знали, что все белые медведи левши?'
+    ]
+    # tf_example(exampleTexts)
+    pt_example(exampleTexts)

Multilingual_CLIP/larger_mclip.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# Multilingual CLIP 2/6-2022
+## Overview
+Recently, OpenAI released some of their [bigger CLIP models](https://github.com/openai/CLIP/blob/main/model-card.md). Additionally, [OpenCLIP](https://github.com/mlfoundations/open_clip) is continuing to provide their large models, which have proven to match or even outperform the OpenAI models.
+Thanks to the compute provided by [Stability.ai](https://stability.ai/) and [laion.ai](https://laion.ai/), we are now happy to announce that we provide multilingual text encoders for these models!
+Along with:
+ - Updated Inference & Training Code
+ - The Corresponding Machine Translated Image Caption Dataset
+ - PyPi package installer
+ <br>
+None of the M-CLIP models have been extensivly evaluated, but testing them on Txt2Img retrieval on the humanly translated MS-COCO dataset, we see the following **R@10** results:
+| Name | En | De | Es | Fr | Zh | It | Pl | Ko | Ru | Tr | Jp |
+| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |:-----: |:-----: |:-----: |:-----: |:-----: |:-----: |
+| [OpenAI CLIP Vit-B/32](https://github.com/openai/CLIP)| 90.3 | - | - | - | - | - | - | - | - | - | - |
+| [OpenAI CLIP Vit-L/14](https://github.com/openai/CLIP)| 91.8 | - | - | - | - | - | - | - | - | - | - |
+| [OpenCLIP ViT-B-16+-](https://github.com/openai/CLIP)| 94.3 | - | - | - | - | - | - | - | - | - | - |
+| [LABSE Vit-L/14](https://huggingface.co/M-CLIP/LABSE-Vit-L-14)| 91.6 | 89.6 | 89.5 | 89.9 | 88.9 | 90.1 | 89.8 | 80.8 | 85.5 | 89.8 | 73.9 |
+| [XLM-R Large Vit-B/32](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-32)| 91.8 | 88.7 | 89.1 | 89.4 | 89.3 | 89.8| 91.4 | 82.1 | 86.1 | 88.8 | 81.0 |
+| [XLM-R Vit-L/14](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-L-14)| 92.4 | 90.6 | 91.0 | 90.0 | 89.7 | 91.1 | 91.3 | 85.2 | 85.8 | 90.3 | 81.9 |
+| [XLM-R Large Vit-B/16+](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus)| <b>95.0</b> | <b>93.0</b> | <b>93.6</b> | <b>93.1</b> | <b>94.0</b> | <b>93.1</b> | <b>94.4</b> | <b>89.0</b> | <b>90.0</b> | <b>93.0</b> | <b>84.2</b> |
+To our surprise, using M-CLIP with XLM-RoBerta Large outperforms the original English models for English. Exactly why this is the case reamins to be determined, and we plan to followup up with more extensive testing.
+The ViT-L/14 model is integrated into clip retrieval, you can test the retrieval capabilities of this multilingual encoder [there](https://rom1504.github.io/clip-retrieval/?useMclip=true&query=%E9%BB%84%E8%89%B2%E3%81%84%E7%8C%AB). This is a search over 5 billion of clip embeddings of laion5B dataset implemented with an efficient knn index.
+The training curves for these models can be found at the [Weights and Biases report](https://wandb.ai/freddefrallan/M-CLIP/reports/M-CLIP-2-6-2022--VmlldzoyMTE1MjU1/edit?firstReport&runsetFilter)
+## Training Data & Machine Translation
+English image captions were taken from the Vit-L filtered captions of the datasets: [CC3M+CC12M+SBU](https://github.com/salesforce/BLIP#pre-training-datasets-download), which are provided by the BLIP repository.
+From these 14 million captions we sampled 7 million captions, divided them into 48 equally sized buckets, and translated each bucket into one of the [48 target languages](https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/translation/data/fine_tune_languages.csv). This means that after translation we still end up with a total of 7 million captions. Where 7M/48 = 145,833 of them are in for example Dutch.
+The machine-translated captions are available at [Huggingface](https://huggingface.co/datasets/M-CLIP/ImageCaptions-7M-Translations).
+Each translation was performed with the corresponding Opus model. For more information see the [machine translation instructions](https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/translation).
+It should be noted that only translated captions were used during training. Meaning that none of the original English captions were included. This entails that all the English (and other languages not included in the 49 target languages) results are due to transfer learning.
+## Training Details
+All released models used in essence the same hyperparameters. These detail are available at [Weights and Biases project](https://wandb.ai/freddefrallan/M-CLIP?workspace=user-freddefrallan).
+Following is a short list of some of the shared hyperparameters:
+ - Batch size of 2048 samples.
+ - Adam Optimizer with a target learning rate of 10^-5, with a linear warmup schedule for 1k update steps.
+ - 5000 randomly sampled validation samples
+All models were allowed to train until the validation MSE loss had converged. For most models this took about 24 hours, using 8 Nvidia A-100 GPUs. No early stopping was performed in regard to the Image-Text retrieval tasks.
+## Additional Experiments
+In addition to the released models, we also performed some experiments that yielded negative or unsubstantial results. The training curves and specific settings for most of these additional experiments can be found at the [Weights and Biases project](https://wandb.ai/freddefrallan/M-CLIP?workspace=user-freddefrallan).
+Following is a summary of things we tried:
+- Optimizing the Cosine-Similarity instead of minimizing the mean-squared error: **No noticeable performance difference**.
+ - MBERT-BASE as encoder: **Worse performance than LaBSE**
+ - USE-CML: **Worse performance than LaBSE**
+ - Adding additional TanH layer to the XLM-R Large: **No substantial performance difference, although it achieved slightly faster learning at the start.**
+ - Using first *([CLS]?)* token as sentence embedding, instead of mean-pooling for XLM-R Large: **Significantly worse performance. *(Perhaps due to the lack of Next-Sentence Prediction task in the RoBerta pre-training?)***

Multilingual_CLIP/legacy_get-weights.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+#
+OUTPATH=$PWD/data/weights
+mkdir -p $OUTPATH
+URLSWECLIP=https://www.dropbox.com/s/s77xw5308jeljlp/Swedish-500k%20Linear%20Weights.pkl
+wget -c "${URLSWECLIP}" -P $OUTPATH
+URLSWECLIP2M=https://www.dropbox.com/s/82c54rsvlry3kwh/Swedish-2M%20Linear%20Weights.pkl
+wget -c "${URLSWECLIP2M}" -P $OUTPATH
+URLMCLIP=https://www.dropbox.com/s/oihqzctnty5e9kk/M-BERT%20Distil%2040%20Linear%20Weights.pkl
+wget -c "${URLMCLIP}" -P $OUTPATH
+URLMCLIPBASE=https://www.dropbox.com/s/y4pycinv0eapeb3/M-BERT-Base-69%20Linear%20Weights.pkl
+wget -c "${URLMCLIPBASE}" -P $OUTPATH
+URLMCLIPBASEVIT=https://www.dropbox.com/s/2oxu7hw0y9fwdqs/M-BERT-Base-69-ViT%20Linear%20Weights.pkl
+wget -c "${URLMCLIPBASEVIT}" -P $OUTPATH

Multilingual_CLIP/legacy_inference.py ADDED Viewed

	@@ -0,0 +1,13 @@

+from multilingual_clip.legacy_multilingual_clip import MultilingualClip
+model_path = 'M-CLIP/Swedish-500k'
+tok_path = 'M-CLIP/Swedish-500k'
+head_weight_path = 'data/weights/Swe-CLIP Linear Weights.pkl'
+sweclip_args = {'model_name': model_path,
+                'tokenizer_name': tok_path,
+                'head_path': head_weight_path}
+sweclip = MultilingualClip(**sweclip_args)
+print(sweclip('test'))