pipeline_tag: sentence-similarity
tags:
- finetuner
- feature-extraction
- sentence-similarity
language: en
license: apache-2.0
The text embedding suit trained by Jina AI, Finetuner team.
Intented Usage & Model Info
jina-embedding-s-en-v1
is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a compact size of just 35 million parameters, the model enables lightning-fast inference while still delivering impressive performance. Additionally, we provide the following options:
jina-embedding-s-en-v1
: 35 million parameters (you are here).jina-embedding-b-en-v1
: 110 million parameters.jina-embedding-l-en-v1
: 330 million parameters.jina-embedding-1b-en-v1
: 1.2 billion parameters, 10* bert-base size (soon).jina-embedding-6b-en-v1
: 6 billion parameters 30* bert-base size(soon).
Data & Parameters
More info will be released together with the technique report.
Metrics
We compared the model against all-minilm-l6-v2
/all-mpnet-base-v2
from sbert and text-embeddings-ada-002
from OpenAI:
Name | param | context |
---|---|---|
all-minilm-l6-v2 | 33m | 128 |
all-mpnet-base-v2 | 110m | 128 |
ada-embedding-002 | Unknown/OpenAI API | 8192 |
jina-embedding-s-en-v1 | 35m | 512 |
jina-embedding-b-en-v1 | 110m | 512 |
jina-embedding-l-en-v1 | 330m | 512 |
Name | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | TRECOVID | Quora | SciFact |
---|---|---|---|---|---|---|---|---|---|
all-minilm-l6-v2 | 0.724 | 0.806 | 0.756 | 0.854 | 0.79 | 0.876 | 0.473 | 0.876 | 0.645 |
all-mpnet-base-v2 | 0.726 | 0.835 | 0.78 | 0.857 | 0.8 | 0.906 | 0.513 | 0.875 | 0.656 |
ada-embedding-002 | 0.698 | 0.833 | 0.761 | 0.861 | 0.86 | 0.903 | 0.685 | 0.876 | 0.726 |
jina-embedding-s-en-v1 | 0.736 | 0.78 | 0.745 | 0.84 | 0.79 | 0.868 | 0.484 | 0.856 | 0.606 |
jina-embedding-b-en-v1 | 0.74 | 0.792 | 0.752 | 0.851 | 0.801 | 0.88 | 0.505 | 0.871 | 0.64 |
jina-embedding-l-en-v1 | 0.739 | 0.844 | 0.778 | 0.863 | 0.829 | 0.896 | 0.526 | 0.882 | 0.652 |
For more tasks and metrics, please checkout MTEB benchmark.
Usage
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-l-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
Fine-tuning
Please consider Finetuner.
Plans
- The development of
jina-embedding-s-en-v2
is currently underway with two main objectives: improving performance and increasing the maximum sequence length. - We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called
jina-embedding-s/b/l-de-v1
.
Contact
Join our Discord community and chat with other community members about ideas.