File size: 8,312 Bytes
ce7cacf
bd07a27
 
 
 
 
3703bef
ce7cacf
468c25b
 
4e789c0
 
 
 
 
 
 
 
 
 
 
 
bd07a27
30359ae
468c25b
30359ae
bd07a27
30359ae
 
bd07a27
30359ae
bd07a27
 
 
 
 
 
 
 
 
30359ae
 
 
 
 
 
 
c5fb59f
 
30359ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5fb59f
 
30359ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5fb59f
 
30359ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48b1a38
51d4f57
2289fe8
30359ae
 
3703bef
 
 
 
 
 
 
 
4e789c0
 
3703bef
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
language:
- th
- en
tags:
- openthaigpt
license: mit
---
<br>

### Introduction
The foundational technology for generative prompt models are Language-Image pretraining models such as CLIP (Contrastive Language-Image Pre-Training) which aligned Language-Image latent of image and text encoder. We can apply latent vector for zero-short classification and image searching. For generative prompt models, we can train generative model using frozen image encoder and then replace image encoder with text encoder to be a prompt of generative model in the inference pipeline. 

**Scope of work**

From limited of computing resources, datasets, engineers we purpose to train CLIP model with 2 stage training of CLIP model
- **Stage 1:** Language encoder distillation training
We will train Thai (or Bilingual EN-TH) text encoder with original CLIP encoder following Multilingual-CLIP using EN-EN, EN-TH text pairs of machine translation datasets.
- **Stage 2:** Continue CLIP pretraining with frozen image encoder
Distillation training model may not understand all of token especially specific words. We have to continue CLIP (or LiT, or SigLiT) pretraining with frozen image encoder to learn details of specific words.
After we have our own CLIP model we will replace CLIP application text encoder with our own text encoder or we may finetuning application model to push performance of our model.

## How to use
- #### Install python package
```python
pip thai2transformers==0.1.2

```
- ### Preprocessing

Texts are preprocessed with the following rules: [process_transformers](https://github.com/vistec-AI/thai2transformers/blob/master/thai2transformers/preprocess.py)

- Replace HTML forms of characters with the actual characters such asnbsp;with a space and \\\\\\\\\\\\\\\\<br /> with a line break [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
- Remove empty brackets ((), {}, and []) than sometimes come up as a result of text extraction such as from Wikipedia.
- Replace line breaks with spaces.
- Replace more than one spaces with a single space
- Remove more than 3 repetitive characters such as ดีมากกก to ดีมาก [Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
- Word-level tokenization using [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU) ’s `newmm` dictionary-based maximal matching tokenizer.
- Replace repetitive words; this is done post-tokenization unlike [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146). since there is no delimitation by space in Thai as in English.
- Replace spaces with <\\\\\\\\\\\\\\\\_>. The SentencePiece tokenizer combines the spaces with other tokens. Since spaces serve as punctuation in Thai such as sentence boundaries similar to periods in English, combining it with other tokens will omit an important feature for tasks such as word tokenization and sentence breaking. Therefore, we opt to explicitly mark spaces with <\\\\\\\\\\\\\\\\_>.
<br>

- #### How to load the text encoder

```python
from transformers import AutoModel,AutoProcessor
from thai2transformers.preprocess import process_transformers
model = AutoModel.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection", trust_remote_code=True)

input_text = ["This is dog",
              "how are you today",
              "สวัสดีครับ วันนี้อากาศร้อนมาก"]
processed_input_text = [process_transformers(input_text_) for input_text_ in input_text ]

text_tokens = processor(text=processed_input_text, padding=True, return_tensors="pt")
embedding = model(**text_tokens).text_embeds

print(embedding,embedding.shape)
```
- #### Output:
```python
tensor([[ 0.0318,  0.0341, -0.1317,  ..., -0.2763, -0.2103,  0.0968],
        [ 0.0579, -0.1373, -0.0293,  ..., -0.3926, -0.2002, -0.0497],
        [ 0.0303,  0.0440,  0.0217,  ..., -0.3282, -0.0100, -0.0757]],
       grad_fn=<MmBackward0>) torch.Size([3, 512])
```

## Eample of model usage

- ### Zero shot classification

```python
from torch import FloatTensor, IntTensor, Tensor
from transformers import AutoModel, AutoProcessor, CLIPModel

# Load image model and processor.
image_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
image_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

# Load text model and processor.
text_processor = AutoProcessor.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection", trust_remote_code=True)
text_model = AutoModel.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection", trust_remote_code=True).to(device)

class_labels = ['แมว','หมา', 'นก']
label2id = {label: i for i, label in enumerate(class_labels)}

inputs = text_processor(text=class_labels, padding=True, return_tensors="pt")
inputs = {name: tensor.to(self.device) for name, tensor in inputs.items()}       
text_embeddings = self.text_model(**inputs).text_embeds
text_embeddings /= text_embeddings.norm(dim=1, keepdim=True)

inputs = image_processor(images=images, return_tensors="pt")
inputs = {name: tensor.to(self.device) for name, tensor in inputs.items()}
image_embeddings = self.image_model.get_image_features(**inputs)
image_embeddings /= image_embeddings.norm(dim=1, keepdim=True)

similarities = torch.mm(image_embeddings, text_embeddings.t())
logits = F.softmax(similarities, dim=1)
indices = torch.argmax(logits, dim=1)

logits = logits.detach().cpu()
indices = indices.detach().cpu()

predict= [class_labels[i] for i in indices ]

```

- ### Text-Image retrieval

```python
import faiss
from torch import FloatTensor, IntTensor, Tensor
from transformers import AutoModel, AutoProcessor, CLIPModel

# Load image model and processor.
image_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
image_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

# Load text model and processor.
text_processor = AutoProcessor.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection", trust_remote_code=True)
text_model = AutoModel.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection", trust_remote_code=True).to(device)

text_input = ['แมวสีส้ม','หมาสีดำ', 'นกสีขาว']
processed_input_text = [process_transformers(input_text_) for input_text_ in input_text ]


inputs = text_processor(text=processed_input_text, padding=True, return_tensors="pt")
inputs = {name: tensor.to(self.device) for name, tensor in inputs.items()}       
text_embeddings = self.text_model(**inputs).text_embeds
text_embeddings /= text_embeddings.norm(dim=1, keepdim=True)

inputs = image_processor(images=images, return_tensors="pt")
inputs = {name: tensor.to(self.device) for name, tensor in inputs.items()}
image_embeddings = self.image_model.get_image_features(**inputs)
image_embeddings /= image_embeddings.norm(dim=1, keepdim=True)


n = text_embeddings.shape[1]

text_index = faiss.IndexFlatIP(n)
image_index = faiss.IndexFlatIP(n)
text_index.add(text_embeddings)
image_index.add(image_embeddings)


# Get_image_search_recall_at_k
distances, retrieved_indices = image_index.search(text_embeddings, k=5)
recall_image_search = sum(1.0 if i in indices else 0.0
            for i, indices in zip(range(n), retrieved_indices)
        ) / float(n)

# Get_text_search_recall_at_k
distances, retrieved_indices = text_index.search(image_embeddings, k=5)
recall_text_search = sum(1.0 if i in indices else 0.0
            for i, indices in zip(range(n), retrieved_indices)
        ) / float(n)

```
### Sponsors
[<img src="https://cdn-uploads.huggingface.co/production/uploads/647d9e689bb822b5cd3cc752/FDVgNU6mQ_OW6IsaXI8_r.png" width="725"/>](image.png)

### Authors
* Konthee Boonmeeprakob (konthee1995@gmail.com)
* Norrawich Jitaree (norrawichjitaree@gmail.com)
* Prapawin Sakdapetchsiri (prapawin.sak@gmail.com)
* Sirasit Tanrattanawong (might.la.fr@gmail.com)
* Phumiphat Charoentananuwat (phumiphatcn@gmail.com)
* Punnaruck Khapholdi (punnaruck@gmail.com)
* Isada Sukprapa (isada@nextai.co.th)
* Monthol Charattrakool (anthrax581@gmail.com)
* Peerawat Rojratchadakorn (peerawat.roj@gmail.com)


<br>