gmastrapas commited on
Commit
b8b8f72
1 Parent(s): 2d6a2ce

feat: push last checkpoint

Browse files
.gitattributes CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ *.original filter=lfs diff=lfs merge=lfs -text
38
+ onnx/model.onnx_data filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,13 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## Usage
2
- Similar usage to jina-clip-v1, the only difference is that it can use matryoshka embeddings, through the truncate_dim argument on the encode_text and encode_image features
 
 
 
3
  ```python
4
- !pip install transformers einops timm pillow
5
  from transformers import AutoModel
6
 
7
  # Initialize the model
8
- model = AutoModel.from_pretrained('jinaai/jina-clip-v2-test', trust_remote_code=True)
9
 
10
- # New meaningful sentences
11
  sentences = ['A blue cat', 'A red cat']
12
 
13
  # Public image URLs
@@ -16,10 +109,12 @@ image_urls = [
16
  'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
17
  ]
18
 
 
 
 
19
  # Encode text and images
20
- truncate = 512
21
- text_embeddings = model.encode_text(sentences, truncate_dim = truncate)
22
- image_embeddings = model.encode_image(image_urls, truncate_dim = truncate) # also accepts PIL.image, local filenames, dataURI
23
 
24
  # Compute similarities
25
  print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
@@ -28,3 +123,125 @@ print(text_embeddings[0] @ image_embeddings[1].T) # text-image cross-modal simil
28
  print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
29
  print(text_embeddings[1] @ image_embeddings[1].T)# text-image cross-modal similarity
30
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: cc-by-nc-4.0
4
+ tags:
5
+ - xlm-roberta
6
+ - eva02
7
+ - clip
8
+ - feature-extraction
9
+ - sentence-similarity
10
+ - retrieval
11
+ - multimodal
12
+ - multi-modal
13
+ - crossmodal
14
+ - cross-modal
15
+ - mteb
16
+ - clip-benchmark
17
+ - vidore
18
+ - transformers
19
+ - sentence-transformers
20
+ - onnx
21
+ - safetensors
22
+ - transformers.js
23
+ language:
24
+ - multilingual
25
+ - ar
26
+ - bn
27
+ - da
28
+ - de
29
+ - el
30
+ - en
31
+ - es
32
+ - fi
33
+ - fr
34
+ - hi
35
+ - id
36
+ - it
37
+ - ja
38
+ - ka
39
+ - ko
40
+ - lv
41
+ - nl
42
+ - no
43
+ - pl
44
+ - pt
45
+ - ro
46
+ - ru
47
+ - sk
48
+ - sv
49
+ - th
50
+ - tr
51
+ - uk
52
+ - ur
53
+ - vi
54
+ - zh
55
+ inference: false
56
+ ---
57
+
58
+ <br><br>
59
+
60
+ <p align="center">
61
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
62
+ </p>
63
+
64
+
65
+ <p align="center">
66
+ <b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
67
+ </p>
68
+
69
+ <p align="center">
70
+ <b>Jina CLIP: your CLIP model is also your text retriever!</b>
71
+ </p>
72
+
73
+
74
+ ## Intended Usage & Model Info
75
+
76
+ `jina-clip-v2` is a state-of-the-art **multilingual and multimodal (text-image) embedding model**.
77
+
78
+ `jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
79
+ * *support for multiple languages* - the text tower now supports 30 languages, including `en`, `zh`, `de`, `ar`, `hi`, `es`
80
+ * *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and in as a result computation and storage costs as well
81
+ * *visual document retrieval performance boost* - with an image resolution of 384 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks, as is evident by the performance gains on the [ViDoRe Benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard), compared to `jina-clip-v1`
82
+
83
+ Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
84
+ This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
85
+
86
+
87
+ ## Data & Parameters
88
+
89
+ [Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
90
+
91
  ## Usage
92
+
93
+ 1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
94
+ 2. Alternatively, you can use the model directly via the transformers/sentence-transformers package.
95
+
96
  ```python
97
+ # !pip install transformers einops timm pillow
98
  from transformers import AutoModel
99
 
100
  # Initialize the model
101
+ model = AutoModel.from_pretrained('jinaai/jina-clip-v2', trust_remote_code=True)
102
 
103
+ # Sentences
104
  sentences = ['A blue cat', 'A red cat']
105
 
106
  # Public image URLs
 
109
  'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
110
  ]
111
 
112
+ # Choose a matryoshka dimension, set to None to get the full 1024-dim vectors
113
+ truncate_dim = 512
114
+
115
  # Encode text and images
116
+ text_embeddings = model.encode_text(sentences, truncate_dim=truncate_dim)
117
+ image_embeddings = model.encode_image(image_urls, truncate_dim=truncate_dim) # also accepts PIL.image, local filenames, dataURI
 
118
 
119
  # Compute similarities
120
  print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
 
123
  print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
124
  print(text_embeddings[1] @ image_embeddings[1].T)# text-image cross-modal similarity
125
  ```
126
+
127
+ or via sentence-transformers:
128
+
129
+ ```python
130
+ # !pip install sentence-transformers
131
+ from sentence_transformers import SentenceTransformer
132
+
133
+ # Initialize the model
134
+ model = SentenceTransformer('jinaai/jina-clip-v2', trust_remote_code=True)
135
+
136
+ # Sentences
137
+ sentences = ['A blue cat', 'A red cat']
138
+
139
+ # Public image URLs
140
+ image_urls = [
141
+ 'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
142
+ 'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
143
+ ]
144
+
145
+ text_embeddings = model.encode(sentences)
146
+ image_embeddings = model.encode(image_urls)
147
+ ```
148
+
149
+ JavaScript developers can use Jina CLIP via the [transformers.js](https://huggingface.co/docs/transformers.js) library. Note that to use this model, you need to install transformers.js [v3](https://github.com/xenova/transformers.js/tree/v3) from source using `npm install xenova/transformers.js#v3`.
150
+
151
+ ```js
152
+ import { AutoTokenizer, CLIPTextModelWithProjection, AutoProcessor, CLIPVisionModelWithProjection, RawImage, cos_sim } from '@xenova/transformers';
153
+
154
+ // Load tokenizer and text model
155
+ const tokenizer = await AutoTokenizer.from_pretrained('jinaai/jina-clip-v2');
156
+ const text_model = await CLIPTextModelWithProjection.from_pretrained('jinaai/jina-clip-v2');
157
+
158
+ // Load processor and vision model
159
+ const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch32');
160
+ const vision_model = await CLIPVisionModelWithProjection.from_pretrained('jinaai/jina-clip-v2');
161
+
162
+ // Run tokenization
163
+ const texts = ['A blue cat', 'A red cat'];
164
+ const text_inputs = tokenizer(texts, { padding: true, truncation: true });
165
+
166
+ // Compute text embeddings
167
+ const { text_embeds } = await text_model(text_inputs);
168
+
169
+ // Read images and run processor
170
+ const urls = [
171
+ 'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
172
+ 'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
173
+ ];
174
+ const image = await Promise.all(urls.map(url => RawImage.read(url)));
175
+ const image_inputs = await processor(image);
176
+
177
+ // Compute vision embeddings
178
+ const { image_embeds } = await vision_model(image_inputs);
179
+
180
+ // Compute similarities
181
+ console.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embedding similarity
182
+ console.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
183
+ console.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
184
+ console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
185
+ console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
186
+ ```
187
+
188
+ ## Performance
189
+
190
+ ### Text-Image Retrieval
191
+
192
+ Coming soon!
193
+
194
+ ### Text-Text Retrieval
195
+
196
+ Coming soon!
197
+
198
+ ## Contact
199
+
200
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
201
+
202
+ ## Citation
203
+
204
+ If you find `jina-clip-v2` useful in your research, please cite the following paper:
205
+
206
+ ```bibtex
207
+ @misc{2405.20204,
208
+ Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
209
+ Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
210
+ Year = {2024},
211
+ Eprint = {arXiv:2405.20204},
212
+ }
213
+ ```
214
+
215
+ ## FAQ
216
+
217
+ ### I encounter this problem, what should I do?
218
+
219
+ ```
220
+ ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_clip.JinaCLIPConfig'> and you passed <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_cli.JinaCLIPConfig'>. Fix one of those so they match!
221
+ ```
222
+
223
+ There was a bug in Transformers library between 4.40.x to 4.41.1. You can update transformers to >4.41.2 or <=4.40.0
224
+
225
+ ### Given one query, how can I merge its text-text and text-image cosine similarity?
226
+
227
+ Our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!
228
+ If you want to merge two scores, we recommended 2 ways:
229
+
230
+ 1. weighted average of text-text sim and text-image sim:
231
+
232
+ ```python
233
+ combined_scores = sim(text, text) + lambda * sim(text, image) # optimal lambda depends on your dataset, but in general lambda=2 can be a good choice.
234
+ ```
235
+
236
+ 2. apply z-score normalization before merging scores:
237
+
238
+ ```python
239
+ # pseudo code
240
+ query_document_mean = np.mean(cos_sim_text_texts)
241
+ query_document_std = np.std(cos_sim_text_texts)
242
+ text_image_mean = np.mean(cos_sim_text_images)
243
+ text_image_std = np.std(cos_sim_text_images)
244
+
245
+ query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
246
+ text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
247
+ ```
config.json CHANGED
@@ -1,6 +1,4 @@
1
  {
2
- "_commit_hash": null,
3
- "_name_or_path": "jinaai/jina-clip-v2-test",
4
  "add_projections": false,
5
  "architectures": [
6
  "JinaCLIPModel"
@@ -11,166 +9,55 @@
11
  },
12
  "initializer_factor": 1.0,
13
  "logit_scale_init_value": 2.6592,
 
14
  "model_type": "jina_clip",
15
  "projection_dim": 1024,
16
- "matryoshka_dimensions": [32, 64, 128, 256, 512, 768, 1024],
17
  "text_config": {
18
- "_name_or_path": "",
19
- "add_cross_attention": false,
20
- "architectures": null,
21
- "bad_words_ids": null,
22
- "begin_suppress_tokens": null,
23
- "bos_token_id": null,
24
- "chunk_size_feed_forward": 0,
25
- "cross_attention_hidden_size": null,
26
- "decoder_start_token_id": null,
27
- "diversity_penalty": 0.0,
28
- "do_sample": false,
29
- "early_stopping": false,
30
  "embed_dim": 1024,
31
- "encoder_no_repeat_ngram_size": 0,
32
- "eos_token_id": null,
33
- "exponential_decay_length_penalty": null,
34
- "finetuning_task": null,
35
- "forced_bos_token_id": null,
36
- "forced_eos_token_id": null,
37
  "hf_model_config_kwargs": {
 
 
 
 
 
 
 
 
 
 
 
38
  "use_flash_attn": false
39
  },
40
- "hf_model_name_or_path": "jinaai/jina-xlm-roberta-large-rope-8k",
41
- "id2label": {
42
- "0": "LABEL_0",
43
- "1": "LABEL_1"
44
- },
45
- "is_decoder": false,
46
- "is_encoder_decoder": false,
47
- "label2id": {
48
- "LABEL_0": 0,
49
- "LABEL_1": 1
50
- },
51
- "length_penalty": 1.0,
52
- "max_length": 20,
53
- "min_length": 0,
54
  "model_type": "jina_clip_text",
55
- "no_repeat_ngram_size": 0,
56
- "num_beam_groups": 1,
57
- "num_beams": 1,
58
- "num_return_sequences": 1,
59
- "output_attentions": false,
60
- "output_hidden_states": false,
61
- "output_scores": false,
62
- "pad_token_id": null,
63
  "pooler_type": "mean_pooler",
64
- "prefix": null,
65
- "problem_type": null,
66
  "proj_bias": false,
67
- "proj_type": null,
68
- "pruned_heads": {},
69
- "remove_invalid_values": false,
70
- "repetition_penalty": 1.0,
71
- "return_dict": true,
72
- "return_dict_in_generate": false,
73
- "sep_token_id": null,
74
- "suppress_tokens": null,
75
- "task_specific_params": null,
76
- "temperature": 1.0,
77
- "tf_legacy_loss": false,
78
- "tie_encoder_decoder": false,
79
- "tie_word_embeddings": true,
80
- "tokenizer_class": null,
81
- "top_k": 50,
82
- "top_p": 1.0,
83
- "torch_dtype": null,
84
- "torchscript": false,
85
- "transformers_version": "4.42.4",
86
- "typical_p": 1.0,
87
- "use_bfloat16": false
88
  },
89
- "torch_dtype": "float16",
90
- "transformers_version": null,
91
  "use_text_flash_attn": null,
92
  "use_vision_xformers": null,
93
  "vision_config": {
94
- "_name_or_path": "",
95
- "add_cross_attention": false,
96
- "architectures": null,
97
- "bad_words_ids": null,
98
- "begin_suppress_tokens": null,
99
- "bos_token_id": null,
100
- "chunk_size_feed_forward": 0,
101
- "cross_attention_hidden_size": null,
102
- "decoder_start_token_id": null,
103
- "diversity_penalty": 0.0,
104
- "do_sample": false,
105
- "drop_path_rate": 0.0,
106
- "early_stopping": false,
107
  "embed_dim": 1024,
108
- "encoder_no_repeat_ngram_size": 0,
109
- "eos_token_id": null,
110
- "exponential_decay_length_penalty": null,
111
- "finetuning_task": null,
112
- "forced_bos_token_id": null,
113
- "forced_eos_token_id": null,
114
  "fused_layer_norm": false,
115
  "head_width": 64,
116
- "id2label": {
117
- "0": "LABEL_0",
118
- "1": "LABEL_1"
119
- },
120
  "image_size": 384,
121
  "intp_freq": true,
122
- "is_decoder": false,
123
- "is_encoder_decoder": false,
124
- "label2id": {
125
- "LABEL_0": 0,
126
- "LABEL_1": 1
127
- },
128
  "layers": 24,
129
- "length_penalty": 1.0,
130
  "ls_init_value": null,
131
- "max_length": 20,
132
- "min_length": 0,
133
  "mlp_ratio": 2.6667,
134
  "model_type": "jina_clip_vision",
135
  "naive_swiglu": true,
136
- "no_repeat_ngram_size": 0,
137
- "num_beam_groups": 1,
138
- "num_beams": 1,
139
- "num_return_sequences": 1,
140
- "output_attentions": false,
141
- "output_hidden_states": false,
142
- "output_scores": false,
143
- "pad_token_id": null,
144
  "patch_dropout": 0.1,
145
  "patch_size": 14,
146
  "post_norm": false,
147
- "prefix": null,
148
- "problem_type": null,
149
  "proj_type": null,
150
- "pruned_heads": {},
151
  "pt_hw_seq_len": 16,
152
  "qkv_bias": true,
153
- "remove_invalid_values": false,
154
- "repetition_penalty": 1.0,
155
- "return_dict": true,
156
- "return_dict_in_generate": false,
157
  "rope_embeddings": true,
158
- "sep_token_id": null,
159
  "subln": true,
160
- "suppress_tokens": null,
161
- "task_specific_params": null,
162
- "temperature": 1.0,
163
- "tf_legacy_loss": false,
164
- "tie_encoder_decoder": false,
165
- "tie_word_embeddings": true,
166
- "tokenizer_class": null,
167
- "top_k": 50,
168
- "top_p": 1.0,
169
- "torch_dtype": null,
170
- "torchscript": false,
171
- "transformers_version": "4.42.4",
172
- "typical_p": 1.0,
173
- "use_bfloat16": false,
174
  "width": 1024,
175
  "x_attention": false
176
  }
 
1
  {
 
 
2
  "add_projections": false,
3
  "architectures": [
4
  "JinaCLIPModel"
 
9
  },
10
  "initializer_factor": 1.0,
11
  "logit_scale_init_value": 2.6592,
12
+ "matryoshka_dimensions": [32, 64, 128, 256, 512, 768, 1024],
13
  "model_type": "jina_clip",
14
  "projection_dim": 1024,
 
15
  "text_config": {
16
+ "default_instruction_task": null,
17
+ "default_lora_task": "retrieval.query",
 
 
 
 
 
 
 
 
 
 
18
  "embed_dim": 1024,
 
 
 
 
 
 
19
  "hf_model_config_kwargs": {
20
+ "load_trained_adapters": false,
21
+ "lora_adaptations": [
22
+ "retrieval.query"
23
+ ],
24
+ "lora_alpha": 4,
25
+ "lora_dropout_p": 0.0,
26
+ "lora_main_params_trainable": false,
27
+ "lora_rank": 4,
28
+ "task_instructions": {
29
+ "retrieval.query": "Represent the query for retrieving evidence documents: "
30
+ },
31
  "use_flash_attn": false
32
  },
33
+ "hf_model_name_or_path": "jinaai/jina-embeddings-v3",
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  "model_type": "jina_clip_text",
 
 
 
 
 
 
 
 
35
  "pooler_type": "mean_pooler",
 
 
36
  "proj_bias": false,
37
+ "proj_type": null
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  },
39
+ "truncate_dim": null,
 
40
  "use_text_flash_attn": null,
41
  "use_vision_xformers": null,
42
  "vision_config": {
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  "embed_dim": 1024,
 
 
 
 
 
 
44
  "fused_layer_norm": false,
45
  "head_width": 64,
 
 
 
 
46
  "image_size": 384,
47
  "intp_freq": true,
 
 
 
 
 
 
48
  "layers": 24,
 
49
  "ls_init_value": null,
 
 
50
  "mlp_ratio": 2.6667,
51
  "model_type": "jina_clip_vision",
52
  "naive_swiglu": true,
 
 
 
 
 
 
 
 
53
  "patch_dropout": 0.1,
54
  "patch_size": 14,
55
  "post_norm": false,
 
 
56
  "proj_type": null,
 
57
  "pt_hw_seq_len": 16,
58
  "qkv_bias": true,
 
 
 
 
59
  "rope_embeddings": true,
 
60
  "subln": true,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  "width": 1024,
62
  "x_attention": false
63
  }
custom_st.py CHANGED
@@ -2,7 +2,7 @@ import base64
2
  import json
3
  import os
4
  from io import BytesIO
5
- from typing import Any, Dict, List, Optional, Tuple, Union
6
 
7
  import requests
8
  import torch
@@ -45,7 +45,7 @@ class Transformer(nn.Module):
45
  tokenizer_name_or_path: str = None,
46
  ) -> None:
47
  super(Transformer, self).__init__()
48
- self.config_keys = ["max_seq_length", "do_lower_case"]
49
  self.do_lower_case = do_lower_case
50
  if model_args is None:
51
  model_args = {}
@@ -60,9 +60,8 @@ class Transformer(nn.Module):
60
  self.jina_clip = AutoModel.from_pretrained(
61
  model_name_or_path, config=config, cache_dir=cache_dir, **model_args
62
  )
63
-
64
- if max_seq_length is not None and "model_max_length" not in tokenizer_args:
65
- tokenizer_args["model_max_length"] = max_seq_length
66
  self.tokenizer = AutoTokenizer.from_pretrained(
67
  (
68
  tokenizer_name_or_path
@@ -85,9 +84,9 @@ class Transformer(nn.Module):
85
  # No max_seq_length set. Try to infer from model
86
  if max_seq_length is None:
87
  if (
88
- hasattr(self.jina_clip, "config")
89
- and hasattr(self.jina_clip.config, "max_position_embeddings")
90
- and hasattr(self.tokenizer, "model_max_length")
91
  ):
92
  max_seq_length = min(
93
  self.jina_clip.config.max_position_embeddings,
@@ -99,23 +98,22 @@ class Transformer(nn.Module):
99
  if tokenizer_name_or_path is not None:
100
  self.jina_clip.config.tokenizer_class = self.tokenizer.__class__.__name__
101
 
102
- def forward(
103
- self, features: Dict[str, torch.Tensor]
104
- ) -> Dict[str, torch.Tensor]:
105
  """Returns token_embeddings, cls_token"""
106
- if "input_ids" in features:
107
  embedding = self.jina_clip.get_text_features(
108
- input_ids=features["input_ids"]
109
  )
110
  else:
111
  embedding = self.jina_clip.get_image_features(
112
- pixel_values=features["pixel_values"]
113
  )
114
- return {"sentence_embedding": embedding}
115
 
116
  def get_word_embedding_dimension(self) -> int:
117
  return self.config.text_config.embed_dim
118
 
 
119
  def decode_data_image(data_image_str):
120
  header, data = data_image_str.split(',', 1)
121
  image_data = base64.b64decode(data)
@@ -135,10 +133,10 @@ class Transformer(nn.Module):
135
  elif sample.startswith('data:image/'):
136
  images.append(self.decode_data_image(sample).convert('RGB'))
137
  else:
138
- # TODO: Make sure that Image.open fails for non-image files
139
  try:
140
  images.append(Image.open(sample).convert('RGB'))
141
- except:
 
142
  texts.append(sample)
143
  elif isinstance(sample, Image.Image):
144
  images.append(sample.convert('RGB'))
@@ -150,8 +148,8 @@ class Transformer(nn.Module):
150
  return self.tokenizer(
151
  texts,
152
  padding=padding,
153
- truncation="longest_first",
154
- return_tensors="pt",
155
  max_length=self.max_seq_length,
156
  )
157
  elif images:
@@ -166,16 +164,16 @@ class Transformer(nn.Module):
166
  self.preprocessor.save_pretrained(output_path)
167
 
168
  @staticmethod
169
- def load(input_path: str) -> "Transformer":
170
  # Old classes used other config names than 'sentence_bert_config.json'
171
  for config_name in [
172
- "sentence_bert_config.json",
173
- "sentence_roberta_config.json",
174
- "sentence_distilbert_config.json",
175
- "sentence_camembert_config.json",
176
- "sentence_albert_config.json",
177
- "sentence_xlm-roberta_config.json",
178
- "sentence_xlnet_config.json",
179
  ]:
180
  sbert_config_path = os.path.join(input_path, config_name)
181
  if os.path.exists(sbert_config_path):
@@ -183,14 +181,16 @@ class Transformer(nn.Module):
183
 
184
  with open(sbert_config_path) as fIn:
185
  config = json.load(fIn)
 
186
  # Don't allow configs to set trust_remote_code
187
- if "model_args" in config and "trust_remote_code" in config["model_args"]:
188
- config["model_args"].pop("trust_remote_code")
189
  if (
190
- "tokenizer_args" in config
191
- and "trust_remote_code" in config["tokenizer_args"]
192
  ):
193
- config["tokenizer_args"].pop("trust_remote_code")
194
- if "config_args" in config and "trust_remote_code" in config["config_args"]:
195
- config["config_args"].pop("trust_remote_code")
196
- return Transformer(model_name_or_path=input_path, **config)
 
 
2
  import json
3
  import os
4
  from io import BytesIO
5
+ from typing import Any, Dict, List, Optional, Union
6
 
7
  import requests
8
  import torch
 
45
  tokenizer_name_or_path: str = None,
46
  ) -> None:
47
  super(Transformer, self).__init__()
48
+ self.config_keys = ['max_seq_length', 'do_lower_case']
49
  self.do_lower_case = do_lower_case
50
  if model_args is None:
51
  model_args = {}
 
60
  self.jina_clip = AutoModel.from_pretrained(
61
  model_name_or_path, config=config, cache_dir=cache_dir, **model_args
62
  )
63
+ if max_seq_length is not None and 'model_max_length' not in tokenizer_args:
64
+ tokenizer_args['model_max_length'] = max_seq_length
 
65
  self.tokenizer = AutoTokenizer.from_pretrained(
66
  (
67
  tokenizer_name_or_path
 
84
  # No max_seq_length set. Try to infer from model
85
  if max_seq_length is None:
86
  if (
87
+ hasattr(self.jina_clip, 'config')
88
+ and hasattr(self.jina_clip.config, 'max_position_embeddings')
89
+ and hasattr(self.tokenizer, 'model_max_length')
90
  ):
91
  max_seq_length = min(
92
  self.jina_clip.config.max_position_embeddings,
 
98
  if tokenizer_name_or_path is not None:
99
  self.jina_clip.config.tokenizer_class = self.tokenizer.__class__.__name__
100
 
101
+ def forward(self, features: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
 
 
102
  """Returns token_embeddings, cls_token"""
103
+ if 'input_ids' in features:
104
  embedding = self.jina_clip.get_text_features(
105
+ input_ids=features['input_ids']
106
  )
107
  else:
108
  embedding = self.jina_clip.get_image_features(
109
+ pixel_values=features['pixel_values']
110
  )
111
+ return {'sentence_embedding': embedding}
112
 
113
  def get_word_embedding_dimension(self) -> int:
114
  return self.config.text_config.embed_dim
115
 
116
+ @staticmethod
117
  def decode_data_image(data_image_str):
118
  header, data = data_image_str.split(',', 1)
119
  image_data = base64.b64decode(data)
 
133
  elif sample.startswith('data:image/'):
134
  images.append(self.decode_data_image(sample).convert('RGB'))
135
  else:
 
136
  try:
137
  images.append(Image.open(sample).convert('RGB'))
138
+ except Exception as e:
139
+ _ = str(e)
140
  texts.append(sample)
141
  elif isinstance(sample, Image.Image):
142
  images.append(sample.convert('RGB'))
 
148
  return self.tokenizer(
149
  texts,
150
  padding=padding,
151
+ truncation='longest_first',
152
+ return_tensors='pt',
153
  max_length=self.max_seq_length,
154
  )
155
  elif images:
 
164
  self.preprocessor.save_pretrained(output_path)
165
 
166
  @staticmethod
167
+ def load(input_path: str) -> 'Transformer':
168
  # Old classes used other config names than 'sentence_bert_config.json'
169
  for config_name in [
170
+ 'sentence_bert_config.json',
171
+ 'sentence_roberta_config.json',
172
+ 'sentence_distilbert_config.json',
173
+ 'sentence_camembert_config.json',
174
+ 'sentence_albert_config.json',
175
+ 'sentence_xlm-roberta_config.json',
176
+ 'sentence_xlnet_config.json',
177
  ]:
178
  sbert_config_path = os.path.join(input_path, config_name)
179
  if os.path.exists(sbert_config_path):
 
181
 
182
  with open(sbert_config_path) as fIn:
183
  config = json.load(fIn)
184
+
185
  # Don't allow configs to set trust_remote_code
186
+ if 'model_args' in config and 'trust_remote_code' in config['model_args']:
187
+ config['model_args'].pop('trust_remote_code')
188
  if (
189
+ 'tokenizer_args' in config
190
+ and 'trust_remote_code' in config['tokenizer_args']
191
  ):
192
+ config['tokenizer_args'].pop('trust_remote_code')
193
+ if 'config_args' in config and 'trust_remote_code' in config['config_args']:
194
+ config['config_args'].pop('trust_remote_code')
195
+
196
+ return Transformer(model_name_or_path=input_path, **config)
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a753294ed5d3d6dc4ae43f784824cdc3a6cbb7e8a815bff2ab200a3f411141a0
3
+ size 1729527426
modules.json CHANGED
@@ -1,14 +1,14 @@
1
  [
2
  {
3
- "idx":0,
4
- "name":"0",
5
- "path":"",
6
- "type":"custom_st.Transformer"
7
  },
8
  {
9
- "idx":2,
10
- "name":"2",
11
- "path":"2_Normalize",
12
- "type":"sentence_transformers.models.Normalize"
13
  }
14
  ]
 
1
  [
2
  {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "custom_st.Transformer"
7
  },
8
  {
9
+ "idx": 2,
10
+ "name": "2",
11
+ "path": "2_Normalize",
12
+ "type": "sentence_transformers.models.Normalize"
13
  }
14
  ]
preprocessor_config.json CHANGED
@@ -19,4 +19,4 @@
19
  0.26130258,
20
  0.27577711
21
  ]
22
- }
 
19
  0.26130258,
20
  0.27577711
21
  ]
22
+ }
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:24141c2796fdf99890be60711ee6f96978be3667d49d199ae22ae9dc51bfc951
3
- size 1724686494
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dcfd3e9d325dd8a59bbce810b59be028f41fc5c6a478e4cc9b5ba0701f61004
3
+ size 1729735014
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
3
- size 17082734
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6601c4120779a1a3863897ba332fe3481d548e363bec2c91eba10ef8640a5e93
3
+ size 17082997