SajjadAyoubi commited on
Commit
729414a
1 Parent(s): 5806018

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -106
README.md CHANGED
@@ -1,106 +0,0 @@
1
- ---
2
- language:
3
- - fa
4
- ---
5
-
6
- <span align="center">
7
- <a href="https://huggingface.co/spaces/SajjadAyoubi/CLIPfa-Demo"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=HF Demo&color=blue"></a>
8
- <a href="https://huggingface.co/SajjadAyoubi/"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Models&color=red"></a>
9
- </span>
10
-
11
- # CLIPfa: Connecting Farsi Text and Images
12
- OpenAI recently released [`the paper Learning Transferable Visual Models From Natural Language Supervision`](https://arxiv.org/abs/2103.00020) in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a vision encoder and a text encoder. These were trained on a 400 Million images and corresponding captions. In this work, we've trained a **Tiny Farsi(Persian)** version of [`OpenAI's CLIP`](https://openai.com/blog/clip/) on a crawled dataset with 400,000 (image, text) pairs. We used [`Farahani's RoBERTa-fa`](https://huggingface.co/m3hrdadfi/roberta-zwnj-wnli-mean-tokens) for text encoder and Original [`CLIP's ViT`](https://huggingface.co/openai/clip-vit-base-patch32) as vision encoder and finetuned them.
13
- ![CLIPfa image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/clipfa.png)
14
- Keep it in mind that, this model was trained only on 400K pairs whereas the Original CLIP was trained on 4m pairs and The training process took 30 days across 592 V100 GPUs.
15
-
16
- ## How to use?
17
- You can use these models of the shelf. Both models create vectors with 768 dimention.
18
- ```python
19
- from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer, CLIPFeatureExtractor
20
- # download pre-trained models
21
- vision_encoder = CLIPVisionModel.from_pretrained('SajjadAyoubi/clip-fa-vision')
22
- preprocessor = CLIPFeatureExtractor.from_pretrained('SajjadAyoubi/clip-fa-vision')
23
- text_encoder = RobertaModel.from_pretrained('SajjadAyoubi/clip-fa-text')
24
- tokenizer = AutoTokenizer.from_pretrained('SajjadAyoubi/clip-fa-text')
25
- # define input image and input text
26
- text = 'whatever you want'
27
- image = PIL.Image.open(image_path)
28
- # compute embeddings
29
- text_embedding = text_encoder(**tokenizer(text, return_tensors='pt')).pooler_output
30
- image_embedding = vision_encoder(**preprocessor(image, return_tensors='pt')).pooler_output
31
- text_embedding.shape == image_embedding.shape
32
- ```
33
-
34
- ## Demo:
35
- The followings are just some use cases of CLIPfa on 25K [`Unsplash images`](https://github.com/unsplash/datasets)
36
- - use `pip install -q git+https://github.com/sajjjadayobi/clipfa.git`
37
- ```python
38
- from clipfa import CLIPDemo
39
- demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
40
- demo.compute_text_embeddings(['سیب','موز' ,'آلبالو'])
41
- demo.compute_image_embeddings(test_df.image_path.to_list())
42
- ```
43
- ### Image Search:
44
- ```python
45
- demo.image_search(query='غروب خورشید')
46
- ```
47
- ![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/image_search.png)
48
-
49
- ```python
50
- demo.image_search(query='جنگل در زمستان برفی')
51
- ```
52
- ![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/forest%20in%20winter.png)
53
-
54
- ### Analogy:
55
- ```python
56
- demo.anology('sunset.jpg', additional_text='دریا')
57
- ```
58
- ![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/analogy-sea.png)
59
-
60
- ```python
61
- demo.anology('sunset.jpg', additional_text='برف')
62
- ```
63
- ![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/analogy-snow.png)
64
-
65
-
66
-
67
- ### Zero Shot Image Classification:
68
- ```python
69
- demo.zero_shot(image_path='apples.jpg')
70
- ```
71
- - Provided labels with their probability for each image.
72
-
73
- | گاو:36 , ماهی:22, اسب:42 | گاو:41 , ماهی:23, اسب:36 | گاو:26 , ماهی:**45**, اسب:27 |
74
- | :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
75
- | ![image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/horse.jpg) | ![image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/cow.jpg) | ![image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/fish.jpg) |
76
-
77
- ### Online Demo: [CLIPfa at Huggingface🤗 spaces](https://huggingface.co/spaces/SajjadAyoubi/CLIPfa-Demo)
78
- We used a small set of images (25K) to keep this app almost real-time, but it's obvious that the quality of image search depends heavily on the size of the image database.
79
-
80
- ![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/hf-spaces.png)
81
-
82
-
83
- ## Dataset:
84
- I was crouse about how much of CLIP's power comes from training on a huge dataset.
85
- 400K from filtered and translated (Flicker30K, MS-COCO, and CCm3)
86
- - Note: We used [`image2ds`](https://github.com/rom1504/img2dataset) a great tool to download large scale image datasets such as MS-COCO. It can download, resize and package 100M URLs in 20h on one machine. Also supports saving captions for url+caption datasets.
87
-
88
-
89
- ## Training: <a href="https://colab.research.google.com/github/sajjjadayobi/CLIPfa/blob/main/notebook/CLIPfa_Training.ipynb"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=CLIPfa Training&color=white"></a>
90
- Any dataset can be used with little change by the [`training code`](https://github.com/sajjjadayobi/CLIPfa/tree/main/clipfa). CLIPfa can be trained with other encoders as long as they have the same hidden size at the last layer. In [`this`](https://github.com/sajjjadayobi/CLIPfa/blob/main/notebook/CLIPfa_Training.ipynb) notebook I used [`training code`](https://github.com/sajjjadayobi/CLIPfa/tree/main/clipfa) to train a small CLIP on translated [flicker30k] dataset.
91
-
92
-
93
- ## Citation: ↩️
94
- If you have a technical question regarding the model, code or publication, create an issue in the repository.
95
- we didn't publish any papers on the work. However, if you did, please cite us properly with an entry like one below.
96
- ```bibtex
97
- @misc{ParsBigBird,
98
- author = {Sajjad Ayoubi},
99
- title = {CLIPfa: Connecting Farsi Text and Images},
100
- year = 2021,
101
- publisher = {GitHub},
102
- journal = {GitHub repository},
103
- howpublished = {\url{https://github.com/SajjjadAyobi/CLIPfa}},
104
- }
105
- ```
106
- > Made with ❤️ in my basement🤫