File size: 3,203 Bytes
7d13513
 
859908c
 
a64467d
e2540d8
a64467d
55df886
 
 
 
 
 
 
 
d477b17
a3e3433
 
f5bccc2
95f65b5
f5bccc2
95f65b5
 
 
 
 
 
 
 
 
 
 
bc9ceaa
576592b
bc9ceaa
 
 
f5bccc2
 
 
318fe06
 
f5bccc2
 
bc9ceaa
576592b
033acd4
d099fb8
944dbdf
bc9ceaa
033acd4
859908c
5037733
bc9ceaa
033acd4
 
0cd1359
 
24287da
 
 
5037733
 
bc9ceaa
576592b
 
 
dea307d
576592b
 
 
 
 
 
d477b17
 
dea307d
d477b17
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: apache-2.0
language:
- en
inference: false
---

<br><br>

<p align="center">
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>The text embedding suit trained by Jina AI, Finetuner team.</b>
</p>


## Intented Usage & Model Info

`jina-embedding-s-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.

The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.

With a compact size of just 35 million parameters,
the model enables lightning-fast inference while still delivering impressive performance.
Additionally, we provide the following options:

- `jina-embedding-s-en-v1`: 35 million parameters **(you are here)**.
- `jina-embedding-b-en-v1`: 110 million parameters.
- `jina-embedding-l-en-v1`: 330 million parameters.
- `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10* bert-base size (soon).
- `jina-embedding-6b-en-v1`: 6 billion parameters 30* bert-base size(soon).

## Data & Parameters

More info will be released together with the technique report.

## Metrics

We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:

|Name|param    |context|
|------------------------------|-----|------|
|all-minilm-l6-v2|33m      |128|
|all-mpnet-base-v2 |110m     |128|
|ada-embedding-002|Unknown/API based  |8192|
|jina-embedding-s-en-v1|35m      |512|
|jina-embedding-b-en-v1|110m      |512|
|jina-embedding-l-en-v1|330m      |512|


|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473   |0.876|0.645  |
|all-mpnet--base-v2|0.726|0.835|0.78 |0.857|0.8  |0.906|0.513   |0.875|0.656  |
|ada-embedding-002|0.698|0.833|0.761|0.861|0.86 |0.903|0.685   |0.876|0.726  |
|jina-embedding-s-en-v1|0.738|0.781|0.732|0.833|0.785|0.859|0.471   |0.852|0.567  |
|jina-embedding-b-en-v1|0.736|0.804|0.745|0.844|0.793|0.873|0.481   |0.87|0.616  |
|jina-embedding-l-en-v1|0.735|0.829|0.759|0.844|0.8|0.888|0.465   |0.876|0.645  |

For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.

## Usage [WIP]

```python
!pip install finetuner[text]
import finetuner
model = finetuner.get_model('jinaai/jina-embedding-s-en-v1')
embeddings = model.encode(['sentence 1', 'sentence 2'])
```

## Fine-tuning [WIP]

Please consider [Finetuner](https://github.com/jina-ai/finetuner).