LinaAlhuri commited on
Commit
a5ac631
1 Parent(s): 79c52a5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ pipeline_tag: text-to-image
5
+ ---
6
+
7
+
8
+
9
+
10
+ ## Model Details
11
+
12
+ Arabic CLIP is an adaptation of the Contrastive Language-Image Pre-training (CLIP) for the Arabic language. CLIP is an OpenAI-developed model that learns conceptual concepts from images and relates them with textual descriptions. This work attempts to improve the model's understanding and interpretation of visual information in the context of the Arabic language.
13
+ Rhis version of Arabic CLIP uses a training technique that consists of two stages where the first stage contains a frozen vision encoder. Then, the encoder is unfrozen in the second stage to allow finetuning.
14
+
15
+
16
+ ## Model Use
17
+
18
+
19
+ ```python
20
+
21
+ from transformers import VisionTextDualEncoderModel, AutoTokenizer
22
+ model = VisionTextDualEncoderModel.from_pretrained("LinaAlhuri/clip-bert-lit-two-stages")
23
+ model.save_pretrained("arabic_clip")
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic", cache_dir=None, use_fast=True)
26
+
27
+ ```
28
+
29
+
30
+
31
+ ## Data
32
+
33
+ This was done through a combination of crawling Wikipedia and using commonly used pre-existing image datasets such as [CC](https://ai.google.com/research/ConceptualCaptions/). One of the most challenging obstacles for multimodal technologies is the fact that Arabic has few data resources, making huge dataset construction difficult. Another is the degradation of translated datasets adapted from well-known publicly available datasets. Whether the choice is to use translated data or genuine data, it is difficult to achieve the desired results depending on only one source, as each choice has its pros and cons. As a result, the goal of this work is to construct the largest Arabic image-text pair collection feasible by merging diverse data sources. This technique takes advantage of the rich information in genuine datasets to compensate for information loss in translated datasets. In contrast, translated datasets contribute to this work with enough pairs that cover a wide range of domains, scenarios, and objects.
34
+
35
+
36
+ | Dataset name | Images |
37
+ | --- | --- |
38
+ |Arabic Conceptual Captions |1,427,210|
39
+ |Arabic COCO 2014 |414,113|
40
+ |Arabic WIT |109,366|
41
+ |Arabic Flicker8K |24,272|
42
+ |Proposed (WAP) dataset |151,252|
43
+ |Total |2,126,213|
44
+
45
+
46
+ ## Performance and Limitations
47
+
48
+ We have tested the efficacy of Arabic CLIP across different benchmarks tailored for tasks like zero-shot learning, image retrieval, localization, and image search.
49
+ - Conceptual Captions
50
+ - COCO
51
+ - ImageNet
52
+ - Unsplash
53
+
54
+ ### Zero-shot Learning
55
+
56
+ | Multilingual CLIP| Top 1 | Top 5 | Top 10 | Top 100 |
57
+ |-----------------------|---------|---------|---------|---------|
58
+ | **Short translation** | 10.10 | 21.99 | 26.70 | 47.57 |
59
+ | **Long translation** | 9.518 | 20.942 | 25.54 | 45.59 |
60
+
61
+
62
+ | Two-stages Arabic CLIP| Top 1 | Top 5 | Top 10 | Top 100 |
63
+ |----------------------|---------|---------|---------|---------|
64
+ | **Short translation** | 17.28 | 36.97 | 45.43 | 73.85 |
65
+ | **Long translation** | 15.52 | 34.74 | 43.49 | 72.28 |
66
+
67
+
68
+
69
+
70
+ ### Image Retrieval
71
+ #### Conceptual Captions Evaluation
72
+
73
+
74
+ | Metric | MCLIP | Two-stages Arabic CLIP |
75
+ |---------|-------|-------------------------|
76
+ | **MRR@1** | 0.064 | 0.151 |
77
+ | **MRR@5** | 0.093 | 0.215 |
78
+ | **MRR@10** | 0.100 | 0.227 |
79
+
80
+
81
+
82
+ #### COCO Evaluation
83
+
84
+ | Metric | MCLIP | Two-stages Arabic CLIP |
85
+ |---------|-------|-------------------------|
86
+ | **MRR@1** | 0.043 | 0.062 |
87
+ | **MRR@5** | 0.068 | 0.097 |
88
+ | **MRR@10** | 0.074 | 0.106 |
89
+
90
+
91
+
92
+
93
+
94
+
95
+
96
+ ## Limitations
97
+ To summarize the limitations into points
98
+ - Arabic CLIP struggles to count after 3.
99
+ - Limited genuine samples for the Arabic language.
100
+ - Various noises and biases might be introduced into Arabic CLIP because no studies have been conducted yet to address this issue in the published Arabic dataset or Arabic language models.
101
+
102
+ ### Bias
103
+ For gender bias, it is important to note that Arabic uses a two-gender system in which all nouns are classified as masculine or feminine.
104
+ However, this is not the case for English. Translating the text from English to Arabic may result in information loss or even make it prone to gender bias.
105
+
106
+
107
+