SajjadAyoubi
commited on
Commit
•
199a6ad
1
Parent(s):
c18ee2b
Update README.md
Browse files
README.md
CHANGED
@@ -19,8 +19,10 @@ tokenizer = AutoTokenizer.from_pretrained('SajjadAyoubi/clip-fa-text')
|
|
19 |
text = 'something'
|
20 |
image = PIL.Image.open('my_favorite_image.jpg')
|
21 |
# compute embeddings
|
22 |
-
text_embedding = text_encoder(**tokenizer(text,
|
23 |
-
|
|
|
|
|
24 |
text_embedding.shape == image_embedding.shape
|
25 |
```
|
26 |
|
@@ -30,7 +32,7 @@ The followings are just some use cases of CLIPfa on 25K [`Unsplash images`](http
|
|
30 |
```python
|
31 |
from clipfa import CLIPDemo
|
32 |
demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
|
33 |
-
demo.compute_text_embeddings(['
|
34 |
demo.compute_image_embeddings(test_df.image_path.to_list())
|
35 |
```
|
36 |
### Image Search:
|
@@ -74,7 +76,7 @@ We used a small set of images (25K) to keep this app almost real-time, but it's
|
|
74 |
## Dataset: 400K
|
75 |
We started with this question that how much the original Clip model depends on its big training dataset containing a lot of conceptual samples. Our model shows that It is possible to meet an acceptable enough target with only a little amount of data even though, It may not have known enough concepts and subjects to be used widely. Our model trained on a dataset gathered from different resources such as The Flickr30k, MS-COCO 2017, Google CCm3, ... . We used these datasets and translated them into the Persian language with a [`tool`](https://github.com/sajjjadayobi/CLIPfa/blob/main/clipfa/data/translation.py) prepared by ourselves. Using the Google Translate and Multilingual Similarity Check method we provided an automatic translator that has been given a list of English captions and filtered by the best translations.
|
76 |
|
77 |
-
- Note: We used [`image2ds`](https://github.com/rom1504/img2dataset) a great tool to download large scale image datasets such as MS-COCO. It can download, resize and package 100M
|
78 |
- [`coco-flickr-fa 130K on Kaggle`](https://www.kaggle.com/navidkanaani/coco-flickr-farsi)
|
79 |
|
80 |
|
19 |
text = 'something'
|
20 |
image = PIL.Image.open('my_favorite_image.jpg')
|
21 |
# compute embeddings
|
22 |
+
text_embedding = text_encoder(**tokenizer(text,
|
23 |
+
return_tensors='pt')).pooler_output
|
24 |
+
image_embedding = vision_encoder(**preprocessor(image,
|
25 |
+
return_tensors='pt')).pooler_output
|
26 |
text_embedding.shape == image_embedding.shape
|
27 |
```
|
28 |
|
32 |
```python
|
33 |
from clipfa import CLIPDemo
|
34 |
demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
|
35 |
+
demo.compute_text_embeddings(['گاو' ,'اسب' ,'ماهی'])
|
36 |
demo.compute_image_embeddings(test_df.image_path.to_list())
|
37 |
```
|
38 |
### Image Search:
|
76 |
## Dataset: 400K
|
77 |
We started with this question that how much the original Clip model depends on its big training dataset containing a lot of conceptual samples. Our model shows that It is possible to meet an acceptable enough target with only a little amount of data even though, It may not have known enough concepts and subjects to be used widely. Our model trained on a dataset gathered from different resources such as The Flickr30k, MS-COCO 2017, Google CCm3, ... . We used these datasets and translated them into the Persian language with a [`tool`](https://github.com/sajjjadayobi/CLIPfa/blob/main/clipfa/data/translation.py) prepared by ourselves. Using the Google Translate and Multilingual Similarity Check method we provided an automatic translator that has been given a list of English captions and filtered by the best translations.
|
78 |
|
79 |
+
- Note: We used [`image2ds`](https://github.com/rom1504/img2dataset) a great tool to download large scale image datasets such as MS-COCO. It can download, resize and package 100M URLs in 20h on one machine. Also supports saving captions for url+caption datasets.
|
80 |
- [`coco-flickr-fa 130K on Kaggle`](https://www.kaggle.com/navidkanaani/coco-flickr-farsi)
|
81 |
|
82 |
|