File size: 3,891 Bytes

7d13513
b22dbaf
 
 
 
 
 
7d13513
e2540d8
a64467d
55df886
 
 
 
 
 
 
 
d477b17
a3e3433
 
f5bccc2
95f65b5
f5bccc2
95f65b5
 
 
 
 
 
 
 
 
 
 
bc9ceaa
576592b
bc9ceaa
 
 
f5bccc2
 
 
318fe06
 
f5bccc2
 
bc9ceaa
576592b
033acd4
d099fb8
944dbdf
bc9ceaa
a26dc06
859908c
5037733
bc9ceaa
033acd4
 
0cd1359
 
a26dc06
219b92e
a26dc06
fe4abd7
 
a26dc06
576592b
 
 
58c0d5a
576592b
 
aa26781
576592b
aa26781
 
 
 
 
 
 
d477b17
 
58c0d5a
d477b17
58c0d5a
 
a9bceaf
eeed8a1
 
 
 
58c0d5a
 
979e1fe

---
pipeline_tag: sentence-similarity
tags:
  - finetuner
  - feature-extraction
  - sentence-similarity
language: en
license: apache-2.0
---

<br><br>

<p align="center">
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>The text embedding suit trained by Jina AI, Finetuner team.</b>
</p>


## Intented Usage & Model Info

`jina-embedding-s-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.

The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.

With a compact size of just 35 million parameters,
the model enables lightning-fast inference while still delivering impressive performance.
Additionally, we provide the following options:

- `jina-embedding-s-en-v1`: 35 million parameters **(you are here)**.
- `jina-embedding-b-en-v1`: 110 million parameters.
- `jina-embedding-l-en-v1`: 330 million parameters.
- `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10* bert-base size (soon).
- `jina-embedding-6b-en-v1`: 6 billion parameters 30* bert-base size(soon).

## Data & Parameters

More info will be released together with the technique report.

## Metrics

We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:

|Name|param    |context|
|------------------------------|-----|------|
|all-minilm-l6-v2|33m      |128|
|all-mpnet-base-v2 |110m     |128|
|ada-embedding-002|Unknown/OpenAI API  |8192|
|jina-embedding-s-en-v1|35m      |512|
|jina-embedding-b-en-v1|110m      |512|
|jina-embedding-l-en-v1|330m      |512|


|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473   |**0.876**|0.645  |
|all-mpnet-base-v2|0.726|**0.835**|**0.78** |0.857|0.8  |**0.906**|0.513   |0.875|0.656  |
|ada-embedding-002|0.698|0.833|0.761|**0.861**|**0.86** |0.903|**0.685**   |**0.876**|**0.726**  |
|jina-embedding-s-en-v1|0.736|0.78|0.745|0.84|0.79|0.868|0.484   |0.856|0.606  |
|jina-embedding-b-en-v1|**0.74**|0.792|0.752|0.851|0.801|0.88|0.505   |0.871|0.64  |
|jina-embedding-l-en-v1|0.736|0.832|0.762|0.846|0.805|0.885|0.477   |**0.876**|0.65  |

For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.

## Usage

```python
!pip install finetuner
import finetuner

model = finetuner.build_model('jinaai/jina-embedding-l-en-v1')
embeddings = finetuner.encode(
    model=model,
    data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
```

## Fine-tuning

Please consider [Finetuner](https://github.com/jina-ai/finetuner).

## Plans

1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.

## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.