license: apache-2.0
The text embedding suit trained by Jina AI, Finetuner team.
Intented Usage & Model Info
jina-embedding-s-en-v1
is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a compact size of just 35 million parameters, the model enables lightning-fast inference while still delivering impressive performance. Additionally, we provide the following options:
jina-embedding-b-en-v1
: 110 million parameters.jina-embedding-l-en-v1
: 800 million parameters.jina-embedding-xl-en-v1
: 3 billion parameters (soon).jina-embedding-xxl-en-v1
: 11 billion parameters (soon).
Data & Parameters
More info will be released together with the technique report.
Metrics
We compared the model against all-minilm-l6-v2
from sbert and text-embeddings-ada-002
from OpenAI:
Name | param | context |
---|---|---|
all-minilm-l6-v2 | 33m | 256 |
all-mpnet--base-v2 | 110m | 256 |
ada-embedding-002 | Unknown/API based | 8192 |
jina-embedding-small | 35m | 512 |
Name | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | TRECOVID | Quora | SciFact |
---|---|---|---|---|---|---|---|---|---|
all-minilm-l6-v2 | 0.724 | 0.806 | 0.756 | 0.854 | 0.79 | 0.876 | 0.473 | 0.876 | 0.645 |
all-mpnet--base-v2 | 0.726 | 0.835 | 0.78 | 0.857 | 0.8 | 0.906 | 0.513 | 0.875 | 0.656 |
ada-embedding-002 | 0.698 | 0.833 | 0.761 | 0.861 | 0.86 | 0.903 | 0.685 | 0.876 | 0.726 |
jina-embedding-small | 0.738 | 0.781 | 0.732 | 0.833 | 0.785 | 0.859 | 0.471 | 0.852 | 0.567 |
For more tasks and metrics, please checkout MTEB benchmark.
Usage
!pip install finetuner[text]
import finetuner
model = finetuner.get_model('jinaai/jina-embedding-s-en-v1')
embeddings = model.encode(['sentence 1', 'sentence 2'])
Fine-tuning
Please consider Finetuner.