bwang0911 commited on
Commit
4b0e79b
1 Parent(s): 92823ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -11
README.md CHANGED
@@ -3115,11 +3115,6 @@ model-index:
3115
  `jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**.
3116
  It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
3117
  We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed German-English input without bias.
3118
-
3119
- The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
3120
- This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc.
3121
-
3122
- With a standard size of 161 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.
3123
  Additionally, we provide the following embedding models:
3124
 
3125
  - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
@@ -3157,8 +3152,8 @@ def mean_pooling(model_output, attention_mask):
3157
 
3158
  sentences = ['How is the weather today?', 'What is the current weather like today?']
3159
 
3160
- tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-small-en')
3161
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True)
3162
 
3163
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
3164
 
@@ -3179,8 +3174,8 @@ from transformers import AutoModel
3179
  from numpy.linalg import norm
3180
 
3181
  cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
3182
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
3183
- embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
3184
  print(cos_sim(embeddings[0], embeddings[1]))
3185
  ```
3186
 
@@ -3208,8 +3203,9 @@ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/b
3208
 
3209
  ## Plans
3210
 
3211
- The development of new bilingual models is currently underway. We will be targeting mainly the German and Spanish languages.
3212
- The upcoming models will be called `jina-embeddings-v2-base-de/es`.
 
3213
 
3214
  ## Contact
3215
 
 
3115
  `jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**.
3116
  It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
3117
  We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed German-English input without bias.
 
 
 
 
 
3118
  Additionally, we provide the following embedding models:
3119
 
3120
  - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
 
3152
 
3153
  sentences = ['How is the weather today?', 'What is the current weather like today?']
3154
 
3155
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-de')
3156
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True)
3157
 
3158
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
3159
 
 
3174
  from numpy.linalg import norm
3175
 
3176
  cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
3177
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True) # trust_remote_code is needed to use the encode method
3178
+ embeddings = model.encode(['How is the weather today?', 'Wie ist das Wetter heute?'])
3179
  print(cos_sim(embeddings[0], embeddings[1]))
3180
  ```
3181
 
 
3203
 
3204
  ## Plans
3205
 
3206
+ 1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
3207
+ 2. Multimodal embedding models enable Multimodal RAG applications.
3208
+ 3. High-performt rerankers.
3209
 
3210
  ## Contact
3211