File size: 3,915 Bytes
7d13513 b22dbaf 17196e6 b22dbaf 7d13513 e2540d8 a64467d 55df886 d477b17 a3e3433 f5bccc2 95f65b5 f5bccc2 95f65b5 bc9ceaa 576592b bc9ceaa f5bccc2 318fe06 f5bccc2 bc9ceaa 576592b 033acd4 d099fb8 944dbdf bc9ceaa a26dc06 859908c 5037733 bc9ceaa 033acd4 0cd1359 6ac1f77 fe4abd7 6ac1f77 576592b 58c0d5a 576592b aa26781 576592b aa26781 d477b17 58c0d5a d477b17 58c0d5a a9bceaf eeed8a1 58c0d5a 979e1fe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
pipeline_tag: sentence-similarity
tags:
- finetuner
- feature-extraction
- sentence-similarity
datasets:
- negation-dataset
language: en
license: apache-2.0
---
<br><br>
<p align="center">
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>
<p align="center">
<b>The text embedding suit trained by Jina AI, Finetuner team.</b>
</p>
## Intented Usage & Model Info
`jina-embedding-s-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a compact size of just 35 million parameters,
the model enables lightning-fast inference while still delivering impressive performance.
Additionally, we provide the following options:
- `jina-embedding-s-en-v1`: 35 million parameters **(you are here)**.
- `jina-embedding-b-en-v1`: 110 million parameters.
- `jina-embedding-l-en-v1`: 330 million parameters.
- `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10* bert-base size (soon).
- `jina-embedding-6b-en-v1`: 6 billion parameters 30* bert-base size(soon).
## Data & Parameters
More info will be released together with the technique report.
## Metrics
We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
|Name|param |context|
|------------------------------|-----|------|
|all-minilm-l6-v2|33m |128|
|all-mpnet-base-v2 |110m |128|
|ada-embedding-002|Unknown/OpenAI API |8192|
|jina-embedding-s-en-v1|35m |512|
|jina-embedding-b-en-v1|110m |512|
|jina-embedding-l-en-v1|330m |512|
|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |0.876|0.645 |
|all-mpnet-base-v2|0.726|0.835|**0.78** |0.857|0.8 |**0.906**|0.513 |0.875|0.656 |
|ada-embedding-002|0.698|0.833|0.761|0.861|**0.86** |0.903|**0.685** |0.876|**0.726** |
|jina-embedding-s-en-v1|0.736|0.78|0.745|0.84|0.79|0.868|0.484 |0.856|0.606 |
|jina-embedding-b-en-v1|**0.74**|0.792|0.752|0.851|0.801|0.88|0.505 |0.871|0.64 |
|jina-embedding-l-en-v1|0.739|**0.844**|0.778|**0.863**|0.829|0.896|0.526 |**0.882**|0.652 |
For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.
## Usage
```python
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-l-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
```
## Fine-tuning
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
## Plans
1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.
## Contact
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |