bwang0911 commited on
Commit
92823ef
1 Parent(s): 0b5d55b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -23
README.md CHANGED
@@ -3097,54 +3097,135 @@ model-index:
3097
  - type: recall_at_5
3098
  value: 13.517999999999999
3099
  ---
3100
-
3101
-
3102
  <!-- TODO: add evaluation results here -->
3103
  <br><br>
3104
 
3105
  <p align="center">
3106
- <img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
3107
  </p>
3108
 
3109
 
3110
  <p align="center">
3111
- <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>, <a href="https://github.com/jina-ai/finetuner"><b>Finetuner</b></a> team.</b>
3112
  </p>
3113
 
3114
 
3115
  ## Intended Usage & Model Info
3116
 
3117
- `jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**. Our model has the same architecture as `jina-embeddings-v2-base-en` and has 161 million parameters.
3118
- We have designed it for high performance in cross-language applications and trained it specifically to support mixed German-English input without bias.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3119
 
3120
- | Model | Language | Max Sequence Length | Dimension | Model Size |
3121
- |:--------------------------------------------------------------------------------------:| :-----: |:-----: |:-----: |:----------:|
3122
- | [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) | English | 8192 | 768 | 0.27GB |
3123
- | [jina-embeddings-v2-base-de](https://huggingface.co/jinaai/jina-embeddings-v2-base-de) | German and English | 8192 | 768 | 0.31GB |
3124
 
 
3125
 
3126
- You can use the embedding model either via the Jina AI's [Embedding platform](https://jina.ai/embeddings/), AWS SageMaker or in your private deployments.
 
 
3127
 
3128
- ## Usage Jina Embedding API
3129
 
3130
- The following code snippet shows the usage of the Jina Embedding API:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3131
  ```
3132
- curl https://api.jina.ai/v1/embeddings \
3133
- -H "Content-Type: application/json" \
3134
- -H "Authorization: Bearer jina_xxxxxxx" \
3135
- -d '{
3136
- "input": ["Ich spreche Deutsch", "or purely in English", "or like mixture of English and Deutsch"],
3137
- "model": "jina-embeddings-v2-base-de"
3138
- }'
3139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3140
  ```
3141
 
3142
- Get your free API key on: https://jina.ai/embeddings/
 
 
 
 
 
 
 
 
3143
 
3144
- ## Opensource
3145
 
3146
- We will add more information about this model and opensource the full model in a few days!
 
 
 
 
3147
 
3148
  ## Contact
3149
 
3150
  Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3097
  - type: recall_at_5
3098
  value: 13.517999999999999
3099
  ---
 
 
3100
  <!-- TODO: add evaluation results here -->
3101
  <br><br>
3102
 
3103
  <p align="center">
3104
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
3105
  </p>
3106
 
3107
 
3108
  <p align="center">
3109
+ <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
3110
  </p>
3111
 
3112
 
3113
  ## Intended Usage & Model Info
3114
 
3115
+ `jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**.
3116
+ It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
3117
+ We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed German-English input without bias.
3118
+
3119
+ The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
3120
+ This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc.
3121
+
3122
+ With a standard size of 161 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.
3123
+ Additionally, we provide the following embedding models:
3124
+
3125
+ - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
3126
+ - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
3127
+ - [`jina-embeddings-v2-base-zh`](): Chinese-English Bilingual embeddings (soon).
3128
+ - [`jina-embeddings-v2-base-de`](): German-English Bilingual embeddings (soon) **(you are here)**.
3129
+ - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
3130
+
3131
+ ## Data & Parameters
3132
+
3133
+ Jina Embeddings V2 [technical report](https://arxiv.org/abs/2310.19923)
3134
+
3135
+ ## Usage
3136
 
3137
+ **<details><summary>Please apply mean pooling when integrating the model.</summary>**
3138
+ <p>
 
 
3139
 
3140
+ ### Why mean pooling?
3141
 
3142
+ `mean poooling` takes all token embeddings from model output and averaging them at sentence/paragraph level.
3143
+ It has been proved to be the most effective way to produce high-quality sentence embeddings.
3144
+ We offer an `encode` function to deal with this.
3145
 
3146
+ However, if you would like to do it without using the default `encode` function:
3147
 
3148
+ ```python
3149
+ import torch
3150
+ import torch.nn.functional as F
3151
+ from transformers import AutoTokenizer, AutoModel
3152
+
3153
+ def mean_pooling(model_output, attention_mask):
3154
+ token_embeddings = model_output[0]
3155
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
3156
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
3157
+
3158
+ sentences = ['How is the weather today?', 'What is the current weather like today?']
3159
+
3160
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-small-en')
3161
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True)
3162
+
3163
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
3164
+
3165
+ with torch.no_grad():
3166
+ model_output = model(**encoded_input)
3167
+
3168
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
3169
+ embeddings = F.normalize(embeddings, p=2, dim=1)
3170
  ```
 
 
 
 
 
 
 
3171
 
3172
+ </p>
3173
+ </details>
3174
+
3175
+ You can use Jina Embedding models directly from transformers package:
3176
+ ```python
3177
+ !pip install transformers
3178
+ from transformers import AutoModel
3179
+ from numpy.linalg import norm
3180
+
3181
+ cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
3182
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
3183
+ embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
3184
+ print(cos_sim(embeddings[0], embeddings[1]))
3185
+ ```
3186
+
3187
+ If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
3188
+
3189
+ ```python
3190
+ embeddings = model.encode(
3191
+ ['Very long ... document'],
3192
+ max_length=2048
3193
+ )
3194
  ```
3195
 
3196
+ ## Fully-managed Embeddings Service
3197
+
3198
+ Alternatively, you can use Jina AI's [Embedding platform](https://jina.ai/embeddings/) for fully-managed access to Jina Embeddings models.
3199
+
3200
+ ## Use Jina Embeddings for RAG
3201
+
3202
+ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83),
3203
+
3204
+ > In summary, to achieve the peak performance in both hit rate and MRR, the combination of OpenAI or JinaAI-Base embeddings with the CohereRerank/bge-reranker-large reranker stands out.
3205
 
3206
+ <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
3207
 
3208
+
3209
+ ## Plans
3210
+
3211
+ The development of new bilingual models is currently underway. We will be targeting mainly the German and Spanish languages.
3212
+ The upcoming models will be called `jina-embeddings-v2-base-de/es`.
3213
 
3214
  ## Contact
3215
 
3216
  Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
3217
+
3218
+ ## Citation
3219
+
3220
+ If you find Jina Embeddings useful in your research, please cite the following paper:
3221
+
3222
+ ```
3223
+ @misc{günther2023jina,
3224
+ title={Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents},
3225
+ author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang and Maximilian Werk and Nan Wang and Han Xiao},
3226
+ year={2023},
3227
+ eprint={2310.19923},
3228
+ archivePrefix={arXiv},
3229
+ primaryClass={cs.CL}
3230
+ }
3231
+ ```