SajjadAyoubi
/

clip-fa-text

@@ -19,8 +19,10 @@ tokenizer = AutoTokenizer.from_pretrained('SajjadAyoubi/clip-fa-text')
 text = 'something'
 image = PIL.Image.open('my_favorite_image.jpg')
 # compute embeddings
-text_embedding = text_encoder(**tokenizer(text, return_tensors='pt')).pooler_output
-image_embedding = vision_encoder(**preprocessor(image, return_tensors='pt')).pooler_output
 text_embedding.shape == image_embedding.shape
 ```
@@ -30,7 +32,7 @@ The followings are just some use cases of CLIPfa on 25K [`Unsplash images`](http
 ```python
 from clipfa import CLIPDemo
 demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
-demo.compute_text_embeddings(['سیب','موز' ,'آلبالو'])
 demo.compute_image_embeddings(test_df.image_path.to_list())
 ```
 ### Image Search:
@@ -74,7 +76,7 @@ We used a small set of images (25K) to keep this app almost real-time, but it's
 ## Dataset: 400K
 We started with this question that how much the original Clip model depends on its big training dataset containing a lot of conceptual samples. Our model shows that It is possible to meet an acceptable enough target with only a little amount of data even though, It may not have known enough concepts and subjects to be used widely. Our model trained on a dataset gathered from different resources such as The Flickr30k, MS-COCO 2017, Google CCm3, ... . We used these datasets and translated them into the Persian language with a [`tool`](https://github.com/sajjjadayobi/CLIPfa/blob/main/clipfa/data/translation.py) prepared by ourselves. Using the Google Translate and Multilingual Similarity Check method we provided an automatic translator that has been given a list of English captions and filtered by the best translations.
-- Note: We used [`image2ds`](https://github.com/rom1504/img2dataset) a great tool to download large scale image datasets such as MS-COCO. It can download, resize and package 100M urls in 20h on one machine. Also supports saving captions for url+caption datasets.
 - [`coco-flickr-fa 130K on Kaggle`](https://www.kaggle.com/navidkanaani/coco-flickr-farsi)

 text = 'something'
 image = PIL.Image.open('my_favorite_image.jpg')
 # compute embeddings
+text_embedding = text_encoder(**tokenizer(text,
+                                          return_tensors='pt')).pooler_output
+image_embedding = vision_encoder(**preprocessor(image,
+                                                return_tensors='pt')).pooler_output
 text_embedding.shape == image_embedding.shape
 ```
 ```python
 from clipfa import CLIPDemo
 demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
+demo.compute_text_embeddings(['گاو' ,'اسب' ,'ماهی'])
 demo.compute_image_embeddings(test_df.image_path.to_list())
 ```
 ### Image Search:
 ## Dataset: 400K
 We started with this question that how much the original Clip model depends on its big training dataset containing a lot of conceptual samples. Our model shows that It is possible to meet an acceptable enough target with only a little amount of data even though, It may not have known enough concepts and subjects to be used widely. Our model trained on a dataset gathered from different resources such as The Flickr30k, MS-COCO 2017, Google CCm3, ... . We used these datasets and translated them into the Persian language with a [`tool`](https://github.com/sajjjadayobi/CLIPfa/blob/main/clipfa/data/translation.py) prepared by ourselves. Using the Google Translate and Multilingual Similarity Check method we provided an automatic translator that has been given a list of English captions and filtered by the best translations.
+- Note: We used [`image2ds`](https://github.com/rom1504/img2dataset) a great tool to download large scale image datasets such as MS-COCO. It can download, resize and package 100M URLs in 20h on one machine. Also supports saving captions for url+caption datasets.
 - [`coco-flickr-fa 130K on Kaggle`](https://www.kaggle.com/navidkanaani/coco-flickr-farsi)