jinaai
/

jina-clip-v2

Model card Files Files and versions Community

gmastrapas commited on Nov 12, 2024

Commit

b8b8f72

1 Parent(s): 2d6a2ce

feat: push last checkpoint

Browse files

Files changed (9) hide show

.gitattributes +2 -0
README.md +224 -7
config.json +17 -130
custom_st.py +35 -35
model.safetensors +3 -0
modules.json +8 -8
preprocessor_config.json +1 -1
pytorch_model.bin +2 -2
tokenizer.json +2 -2

.gitattributes CHANGED Viewed

@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+*.original filter=lfs diff=lfs merge=lfs -text
+onnx/model.onnx_data filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,13 +1,106 @@
 ## Usage
-Similar usage to jina-clip-v1, the only difference is that it can use matryoshka embeddings, through the truncate_dim argument on the encode_text and encode_image features
 ```python
-!pip install transformers einops timm pillow
 from transformers import AutoModel
 # Initialize the model
-model = AutoModel.from_pretrained('jinaai/jina-clip-v2-test', trust_remote_code=True)
-# New meaningful sentences
 sentences = ['A blue cat', 'A red cat']
 # Public image URLs
@@ -16,10 +109,12 @@ image_urls = [
     'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
 ]
 # Encode text and images
-truncate = 512
-text_embeddings = model.encode_text(sentences, truncate_dim = truncate)
-image_embeddings = model.encode_image(image_urls, truncate_dim = truncate)  # also accepts PIL.image, local filenames, dataURI
 # Compute similarities
 print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
@@ -28,3 +123,125 @@ print(text_embeddings[0] @ image_embeddings[1].T) # text-image cross-modal simil
 print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
 print(text_embeddings[1] @ image_embeddings[1].T)# text-image cross-modal similarity
 ```

+---
+library_name: transformers
+license: cc-by-nc-4.0
+tags:
+- xlm-roberta
+- eva02
+- clip
+- feature-extraction
+- sentence-similarity
+- retrieval
+- multimodal
+- multi-modal
+- crossmodal
+- cross-modal
+- mteb
+- clip-benchmark
+- vidore
+- transformers
+- sentence-transformers
+- onnx
+- safetensors
+- transformers.js
+language:
+  - multilingual
+  - ar
+  - bn
+  - da
+  - de
+  - el
+  - en
+  - es
+  - fi
+  - fr
+  - hi
+  - id
+  - it
+  - ja
+  - ka
+  - ko
+  - lv
+  - nl
+  - no
+  - pl
+  - pt
+  - ro
+  - ru
+  - sk
+  - sv
+  - th
+  - tr
+  - uk
+  - ur
+  - vi
+  - zh
+inference: false
+---
+<br><br>
+<p align="center">
+<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
+</p>
+<p align="center">
+<b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
+</p>
+<p align="center">
+<b>Jina CLIP: your CLIP model is also your text retriever!</b>
+</p>
+## Intended Usage & Model Info
+`jina-clip-v2` is a state-of-the-art **multilingual and multimodal (text-image) embedding model**.
+`jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
+* *support for multiple languages* - the text tower now supports 30 languages, including `en`, `zh`, `de`, `ar`, `hi`, `es`
+* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and in as a result computation and storage costs as well
+* *visual document retrieval performance boost* - with an image resolution of 384 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks, as is evident by the performance gains on the [ViDoRe Benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard), compared to `jina-clip-v1`
+Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
+This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
+## Data & Parameters
+[Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
 ## Usage
+1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
+2. Alternatively, you can use the model directly via the transformers/sentence-transformers package.
 ```python
+# !pip install transformers einops timm pillow
 from transformers import AutoModel
 # Initialize the model
+model = AutoModel.from_pretrained('jinaai/jina-clip-v2', trust_remote_code=True)
+# Sentences
 sentences = ['A blue cat', 'A red cat']
 # Public image URLs
     'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
 ]
+# Choose a matryoshka dimension, set to None to get the full 1024-dim vectors
+truncate_dim = 512
 # Encode text and images
+text_embeddings = model.encode_text(sentences, truncate_dim=truncate_dim)
+image_embeddings = model.encode_image(image_urls, truncate_dim=truncate_dim)  # also accepts PIL.image, local filenames, dataURI
 # Compute similarities
 print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
 print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
 print(text_embeddings[1] @ image_embeddings[1].T)# text-image cross-modal similarity
 ```
+or via sentence-transformers:
+```python
+# !pip install sentence-transformers
+from sentence_transformers import SentenceTransformer
+# Initialize the model
+model = SentenceTransformer('jinaai/jina-clip-v2', trust_remote_code=True)
+# Sentences
+sentences = ['A blue cat', 'A red cat']
+# Public image URLs
+image_urls = [
+    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
+    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
+]
+text_embeddings = model.encode(sentences)
+image_embeddings = model.encode(image_urls)
+```
+JavaScript developers can use Jina CLIP via the [transformers.js](https://huggingface.co/docs/transformers.js) library. Note that to use this model, you need to install transformers.js [v3](https://github.com/xenova/transformers.js/tree/v3) from source using `npm install xenova/transformers.js#v3`.
+```js
+import { AutoTokenizer, CLIPTextModelWithProjection, AutoProcessor, CLIPVisionModelWithProjection, RawImage, cos_sim } from '@xenova/transformers';
+// Load tokenizer and text model
+const tokenizer = await AutoTokenizer.from_pretrained('jinaai/jina-clip-v2');
+const text_model = await CLIPTextModelWithProjection.from_pretrained('jinaai/jina-clip-v2');
+// Load processor and vision model
+const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch32');
+const vision_model = await CLIPVisionModelWithProjection.from_pretrained('jinaai/jina-clip-v2');
+// Run tokenization
+const texts = ['A blue cat', 'A red cat'];
+const text_inputs = tokenizer(texts, { padding: true, truncation: true });
+// Compute text embeddings
+const { text_embeds } = await text_model(text_inputs);
+// Read images and run processor
+const urls = [
+    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
+    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
+];
+const image = await Promise.all(urls.map(url => RawImage.read(url)));
+const image_inputs = await processor(image);
+// Compute vision embeddings
+const { image_embeds } = await vision_model(image_inputs);
+//  Compute similarities
+console.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embedding similarity
+console.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
+console.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
+console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
+console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
+```
+## Performance
+### Text-Image Retrieval
+Coming soon!
+### Text-Text Retrieval
+Coming soon!
+## Contact
+Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
+## Citation
+If you find `jina-clip-v2` useful in your research, please cite the following paper:
+```bibtex
+@misc{2405.20204,
+    Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
+    Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
+    Year = {2024},
+    Eprint = {arXiv:2405.20204},
+}
+```
+## FAQ
+### I encounter this problem, what should I do?
+```
+ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_clip.JinaCLIPConfig'> and you passed <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_cli.JinaCLIPConfig'>. Fix one of those so they match!
+```
+There was a bug in Transformers library between 4.40.x to 4.41.1. You can update transformers to >4.41.2 or <=4.40.0
+### Given one query, how can I merge its text-text and text-image cosine similarity?
+Our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!
+If you want to merge two scores, we recommended 2 ways:
+1. weighted average of text-text sim and text-image sim:
+```python
+combined_scores = sim(text, text) + lambda * sim(text, image)  # optimal lambda depends on your dataset, but in general lambda=2 can be a good choice.
+```
+2. apply z-score normalization before merging scores:
+```python
+# pseudo code
+query_document_mean = np.mean(cos_sim_text_texts)
+query_document_std = np.std(cos_sim_text_texts)
+text_image_mean = np.mean(cos_sim_text_images)
+text_image_std = np.std(cos_sim_text_images)
+query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
+text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
+```

config.json CHANGED Viewed

@@ -1,6 +1,4 @@
 {
-    "_commit_hash": null,
-    "_name_or_path": "jinaai/jina-clip-v2-test",
     "add_projections": false,
     "architectures": [
         "JinaCLIPModel"
@@ -11,166 +9,55 @@
     },
     "initializer_factor": 1.0,
     "logit_scale_init_value": 2.6592,
     "model_type": "jina_clip",
     "projection_dim": 1024,
-    "matryoshka_dimensions": [32, 64, 128, 256, 512, 768, 1024],
     "text_config": {
-        "_name_or_path": "",
-        "add_cross_attention": false,
-        "architectures": null,
-        "bad_words_ids": null,
-        "begin_suppress_tokens": null,
-        "bos_token_id": null,
-        "chunk_size_feed_forward": 0,
-        "cross_attention_hidden_size": null,
-        "decoder_start_token_id": null,
-        "diversity_penalty": 0.0,
-        "do_sample": false,
-        "early_stopping": false,
         "embed_dim": 1024,
-        "encoder_no_repeat_ngram_size": 0,
-        "eos_token_id": null,
-        "exponential_decay_length_penalty": null,
-        "finetuning_task": null,
-        "forced_bos_token_id": null,
-        "forced_eos_token_id": null,
         "hf_model_config_kwargs": {
             "use_flash_attn": false
         },
-        "hf_model_name_or_path": "jinaai/jina-xlm-roberta-large-rope-8k",
-        "id2label": {
-            "0": "LABEL_0",
-            "1": "LABEL_1"
-        },
-        "is_decoder": false,
-        "is_encoder_decoder": false,
-        "label2id": {
-            "LABEL_0": 0,
-            "LABEL_1": 1
-        },
-        "length_penalty": 1.0,
-        "max_length": 20,
-        "min_length": 0,
         "model_type": "jina_clip_text",
-        "no_repeat_ngram_size": 0,
-        "num_beam_groups": 1,
-        "num_beams": 1,
-        "num_return_sequences": 1,
-        "output_attentions": false,
-        "output_hidden_states": false,
-        "output_scores": false,
-        "pad_token_id": null,
         "pooler_type": "mean_pooler",
-        "prefix": null,
-        "problem_type": null,
         "proj_bias": false,
-        "proj_type": null,
-        "pruned_heads": {},
-        "remove_invalid_values": false,
-        "repetition_penalty": 1.0,
-        "return_dict": true,
-        "return_dict_in_generate": false,
-        "sep_token_id": null,
-        "suppress_tokens": null,
-        "task_specific_params": null,
-        "temperature": 1.0,
-        "tf_legacy_loss": false,
-        "tie_encoder_decoder": false,
-        "tie_word_embeddings": true,
-        "tokenizer_class": null,
-        "top_k": 50,
-        "top_p": 1.0,
-        "torch_dtype": null,
-        "torchscript": false,
-        "transformers_version": "4.42.4",
-        "typical_p": 1.0,
-        "use_bfloat16": false
     },
-    "torch_dtype": "float16",
-    "transformers_version": null,
     "use_text_flash_attn": null,
     "use_vision_xformers": null,
     "vision_config": {
-        "_name_or_path": "",
-        "add_cross_attention": false,
-        "architectures": null,
-        "bad_words_ids": null,
-        "begin_suppress_tokens": null,
-        "bos_token_id": null,
-        "chunk_size_feed_forward": 0,
-        "cross_attention_hidden_size": null,
-        "decoder_start_token_id": null,
-        "diversity_penalty": 0.0,
-        "do_sample": false,
-        "drop_path_rate": 0.0,
-        "early_stopping": false,
         "embed_dim": 1024,
-        "encoder_no_repeat_ngram_size": 0,
-        "eos_token_id": null,
-        "exponential_decay_length_penalty": null,
-        "finetuning_task": null,
-        "forced_bos_token_id": null,
-        "forced_eos_token_id": null,
         "fused_layer_norm": false,
         "head_width": 64,
-        "id2label": {
-            "0": "LABEL_0",
-            "1": "LABEL_1"
-        },
         "image_size": 384,
         "intp_freq": true,
-        "is_decoder": false,
-        "is_encoder_decoder": false,
-        "label2id": {
-            "LABEL_0": 0,
-            "LABEL_1": 1
-        },
         "layers": 24,
-        "length_penalty": 1.0,
         "ls_init_value": null,
-        "max_length": 20,
-        "min_length": 0,
         "mlp_ratio": 2.6667,
         "model_type": "jina_clip_vision",
         "naive_swiglu": true,
-        "no_repeat_ngram_size": 0,
-        "num_beam_groups": 1,
-        "num_beams": 1,
-        "num_return_sequences": 1,
-        "output_attentions": false,
-        "output_hidden_states": false,
-        "output_scores": false,
-        "pad_token_id": null,
         "patch_dropout": 0.1,
         "patch_size": 14,
         "post_norm": false,
-        "prefix": null,
-        "problem_type": null,
         "proj_type": null,
-        "pruned_heads": {},
         "pt_hw_seq_len": 16,
         "qkv_bias": true,
-        "remove_invalid_values": false,
-        "repetition_penalty": 1.0,
-        "return_dict": true,
-        "return_dict_in_generate": false,
         "rope_embeddings": true,
-        "sep_token_id": null,
         "subln": true,
-        "suppress_tokens": null,
-        "task_specific_params": null,
-        "temperature": 1.0,
-        "tf_legacy_loss": false,
-        "tie_encoder_decoder": false,
-        "tie_word_embeddings": true,
-        "tokenizer_class": null,
-        "top_k": 50,
-        "top_p": 1.0,
-        "torch_dtype": null,
-        "torchscript": false,
-        "transformers_version": "4.42.4",
-        "typical_p": 1.0,
-        "use_bfloat16": false,
         "width": 1024,
         "x_attention": false
     }

 {
     "add_projections": false,
     "architectures": [
         "JinaCLIPModel"
     },
     "initializer_factor": 1.0,
     "logit_scale_init_value": 2.6592,
+    "matryoshka_dimensions": [32, 64, 128, 256, 512, 768, 1024],
     "model_type": "jina_clip",
     "projection_dim": 1024,
     "text_config": {
+        "default_instruction_task": null,
+        "default_lora_task": "retrieval.query",
         "embed_dim": 1024,
         "hf_model_config_kwargs": {
+            "load_trained_adapters": false,
+            "lora_adaptations": [
+                "retrieval.query"
+            ],
+            "lora_alpha": 4,
+            "lora_dropout_p": 0.0,
+            "lora_main_params_trainable": false,
+            "lora_rank": 4,
+            "task_instructions": {
+                "retrieval.query": "Represent the query for retrieving evidence documents: "
+            },
             "use_flash_attn": false
         },
+        "hf_model_name_or_path": "jinaai/jina-embeddings-v3",
         "model_type": "jina_clip_text",
         "pooler_type": "mean_pooler",
         "proj_bias": false,
+        "proj_type": null
     },
+    "truncate_dim": null,
     "use_text_flash_attn": null,
     "use_vision_xformers": null,
     "vision_config": {
         "embed_dim": 1024,
         "fused_layer_norm": false,
         "head_width": 64,
         "image_size": 384,
         "intp_freq": true,
         "layers": 24,
         "ls_init_value": null,
         "mlp_ratio": 2.6667,
         "model_type": "jina_clip_vision",
         "naive_swiglu": true,
         "patch_dropout": 0.1,
         "patch_size": 14,
         "post_norm": false,
         "proj_type": null,
         "pt_hw_seq_len": 16,
         "qkv_bias": true,
         "rope_embeddings": true,
         "subln": true,
         "width": 1024,
         "x_attention": false
     }

custom_st.py CHANGED Viewed

@@ -2,7 +2,7 @@ import base64
 import json
 import os
 from io import BytesIO
-from typing import Any, Dict, List, Optional, Tuple, Union
 import requests
 import torch
@@ -45,7 +45,7 @@ class Transformer(nn.Module):
         tokenizer_name_or_path: str = None,
     ) -> None:
         super(Transformer, self).__init__()
-        self.config_keys = ["max_seq_length", "do_lower_case"]
         self.do_lower_case = do_lower_case
         if model_args is None:
             model_args = {}
@@ -60,9 +60,8 @@ class Transformer(nn.Module):
         self.jina_clip = AutoModel.from_pretrained(
             model_name_or_path, config=config, cache_dir=cache_dir, **model_args
         )
-        if max_seq_length is not None and "model_max_length" not in tokenizer_args:
-            tokenizer_args["model_max_length"] = max_seq_length
         self.tokenizer = AutoTokenizer.from_pretrained(
             (
                 tokenizer_name_or_path
@@ -85,9 +84,9 @@ class Transformer(nn.Module):
         # No max_seq_length set. Try to infer from model
         if max_seq_length is None:
             if (
-                hasattr(self.jina_clip, "config")
-                and hasattr(self.jina_clip.config, "max_position_embeddings")
-                and hasattr(self.tokenizer, "model_max_length")
             ):
                 max_seq_length = min(
                     self.jina_clip.config.max_position_embeddings,
@@ -99,23 +98,22 @@ class Transformer(nn.Module):
         if tokenizer_name_or_path is not None:
             self.jina_clip.config.tokenizer_class = self.tokenizer.__class__.__name__
-    def forward(
-        self, features: Dict[str, torch.Tensor]
-    ) -> Dict[str, torch.Tensor]:
         """Returns token_embeddings, cls_token"""
-        if "input_ids" in features:
             embedding = self.jina_clip.get_text_features(
-                input_ids=features["input_ids"]
             )
         else:
             embedding = self.jina_clip.get_image_features(
-                pixel_values=features["pixel_values"]
             )
-        return {"sentence_embedding": embedding}
     def get_word_embedding_dimension(self) -> int:
         return self.config.text_config.embed_dim
     def decode_data_image(data_image_str):
         header, data = data_image_str.split(',', 1)
         image_data = base64.b64decode(data)
@@ -135,10 +133,10 @@ class Transformer(nn.Module):
                 elif sample.startswith('data:image/'):
                     images.append(self.decode_data_image(sample).convert('RGB'))
                 else:
-                    # TODO: Make sure that Image.open fails for non-image files
                     try:
                         images.append(Image.open(sample).convert('RGB'))
-                    except:
                         texts.append(sample)
             elif isinstance(sample, Image.Image):
                 images.append(sample.convert('RGB'))
@@ -150,8 +148,8 @@ class Transformer(nn.Module):
             return self.tokenizer(
                 texts,
                 padding=padding,
-                truncation="longest_first",
-                return_tensors="pt",
                 max_length=self.max_seq_length,
             )
         elif images:
@@ -166,16 +164,16 @@ class Transformer(nn.Module):
         self.preprocessor.save_pretrained(output_path)
     @staticmethod
-    def load(input_path: str) -> "Transformer":
         # Old classes used other config names than 'sentence_bert_config.json'
         for config_name in [
-            "sentence_bert_config.json",
-            "sentence_roberta_config.json",
-            "sentence_distilbert_config.json",
-            "sentence_camembert_config.json",
-            "sentence_albert_config.json",
-            "sentence_xlm-roberta_config.json",
-            "sentence_xlnet_config.json",
         ]:
             sbert_config_path = os.path.join(input_path, config_name)
             if os.path.exists(sbert_config_path):
@@ -183,14 +181,16 @@ class Transformer(nn.Module):
         with open(sbert_config_path) as fIn:
             config = json.load(fIn)
         # Don't allow configs to set trust_remote_code
-        if "model_args" in config and "trust_remote_code" in config["model_args"]:
-            config["model_args"].pop("trust_remote_code")
         if (
-            "tokenizer_args" in config
-            and "trust_remote_code" in config["tokenizer_args"]
         ):
-            config["tokenizer_args"].pop("trust_remote_code")
-        if "config_args" in config and "trust_remote_code" in config["config_args"]:
-            config["config_args"].pop("trust_remote_code")
-        return Transformer(model_name_or_path=input_path, **config)

 import json
 import os
 from io import BytesIO
+from typing import Any, Dict, List, Optional, Union
 import requests
 import torch
         tokenizer_name_or_path: str = None,
     ) -> None:
         super(Transformer, self).__init__()
+        self.config_keys = ['max_seq_length', 'do_lower_case']
         self.do_lower_case = do_lower_case
         if model_args is None:
             model_args = {}
         self.jina_clip = AutoModel.from_pretrained(
             model_name_or_path, config=config, cache_dir=cache_dir, **model_args
         )
+        if max_seq_length is not None and 'model_max_length' not in tokenizer_args:
+            tokenizer_args['model_max_length'] = max_seq_length
         self.tokenizer = AutoTokenizer.from_pretrained(
             (
                 tokenizer_name_or_path
         # No max_seq_length set. Try to infer from model
         if max_seq_length is None:
             if (
+                hasattr(self.jina_clip, 'config')
+                and hasattr(self.jina_clip.config, 'max_position_embeddings')
+                and hasattr(self.tokenizer, 'model_max_length')
             ):
                 max_seq_length = min(
                     self.jina_clip.config.max_position_embeddings,
         if tokenizer_name_or_path is not None:
             self.jina_clip.config.tokenizer_class = self.tokenizer.__class__.__name__
+    def forward(self, features: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
         """Returns token_embeddings, cls_token"""
+        if 'input_ids' in features:
             embedding = self.jina_clip.get_text_features(
+                input_ids=features['input_ids']
             )
         else:
             embedding = self.jina_clip.get_image_features(
+                pixel_values=features['pixel_values']
             )
+        return {'sentence_embedding': embedding}
     def get_word_embedding_dimension(self) -> int:
         return self.config.text_config.embed_dim
+    @staticmethod
     def decode_data_image(data_image_str):
         header, data = data_image_str.split(',', 1)
         image_data = base64.b64decode(data)
                 elif sample.startswith('data:image/'):
                     images.append(self.decode_data_image(sample).convert('RGB'))
                 else:
                     try:
                         images.append(Image.open(sample).convert('RGB'))
+                    except Exception as e:
+                        _ = str(e)
                         texts.append(sample)
             elif isinstance(sample, Image.Image):
                 images.append(sample.convert('RGB'))
             return self.tokenizer(
                 texts,
                 padding=padding,
+                truncation='longest_first',
+                return_tensors='pt',
                 max_length=self.max_seq_length,
             )
         elif images:
         self.preprocessor.save_pretrained(output_path)
     @staticmethod
+    def load(input_path: str) -> 'Transformer':
         # Old classes used other config names than 'sentence_bert_config.json'
         for config_name in [
+            'sentence_bert_config.json',
+            'sentence_roberta_config.json',
+            'sentence_distilbert_config.json',
+            'sentence_camembert_config.json',
+            'sentence_albert_config.json',
+            'sentence_xlm-roberta_config.json',
+            'sentence_xlnet_config.json',
         ]:
             sbert_config_path = os.path.join(input_path, config_name)
             if os.path.exists(sbert_config_path):
         with open(sbert_config_path) as fIn:
             config = json.load(fIn)
         # Don't allow configs to set trust_remote_code
+        if 'model_args' in config and 'trust_remote_code' in config['model_args']:
+            config['model_args'].pop('trust_remote_code')
         if (
+            'tokenizer_args' in config
+            and 'trust_remote_code' in config['tokenizer_args']
         ):
+            config['tokenizer_args'].pop('trust_remote_code')
+        if 'config_args' in config and 'trust_remote_code' in config['config_args']:
+            config['config_args'].pop('trust_remote_code')
+        return Transformer(model_name_or_path=input_path, **config)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a753294ed5d3d6dc4ae43f784824cdc3a6cbb7e8a815bff2ab200a3f411141a0
+size 1729527426

modules.json CHANGED Viewed

@@ -1,14 +1,14 @@
 [
     {
-        "idx":0,
-        "name":"0",
-        "path":"",
-        "type":"custom_st.Transformer"
     },
     {
-        "idx":2,
-        "name":"2",
-        "path":"2_Normalize",
-        "type":"sentence_transformers.models.Normalize"
     }
 ]

 [
     {
+        "idx": 0,
+        "name": "0",
+        "path": "",
+        "type": "custom_st.Transformer"
     },
     {
+        "idx": 2,
+        "name": "2",
+        "path": "2_Normalize",
+        "type": "sentence_transformers.models.Normalize"
     }
 ]

preprocessor_config.json CHANGED Viewed

@@ -19,4 +19,4 @@
         0.26130258,
         0.27577711
     ]
-}

         0.26130258,
         0.27577711
     ]
+}

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:24141c2796fdf99890be60711ee6f96978be3667d49d199ae22ae9dc51bfc951
-size 1724686494

 version https://git-lfs.github.com/spec/v1
+oid sha256:7dcfd3e9d325dd8a59bbce810b59be028f41fc5c6a478e4cc9b5ba0701f61004
+size 1729735014

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
-size 17082734

 version https://git-lfs.github.com/spec/v1
+oid sha256:6601c4120779a1a3863897ba332fe3481d548e363bec2c91eba10ef8640a5e93
+size 17082997