license: apache-2.0
language:
- en
inference: false
The text embedding suit trained by Jina AI, Finetuner team.
Intented Usage & Model Info
jina-embedding-s-en-v1
is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a compact size of just 35 million parameters, the model enables lightning-fast inference while still delivering impressive performance. Additionally, we provide the following options:
jina-embedding-s-en-v1
: 35 million parameters (you are here).jina-embedding-b-en-v1
: 110 million parameters.jina-embedding-l-en-v1
: 330 million parameters.jina-embedding-1b-en-v1
: 1.2 billion parameters, 10* bert-base size (soon).jina-embedding-6b-en-v1
: 6 billion parameters 30* bert-base size(soon).
Data & Parameters
More info will be released together with the technique report.
Metrics
We compared the model against all-minilm-l6-v2
/all-mpnet-base-v2
from sbert and text-embeddings-ada-002
from OpenAI:
Name | param | context |
---|---|---|
all-minilm-l6-v2 | 33m | 128 |
all-mpnet-base-v2 | 110m | 128 |
ada-embedding-002 | Unknown/OpenAI API | 8192 |
jina-embedding-s-en-v1 | 35m | 512 |
jina-embedding-b-en-v1 | 110m | 512 |
jina-embedding-l-en-v1 | 330m | 512 |
Name | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | TRECOVID | Quora | SciFact |
---|---|---|---|---|---|---|---|---|---|
all-minilm-l6-v2 | 0.724 | 0.806 | 0.756 | 0.854 | 0.79 | 0.876 | 0.473 | 0.876 | 0.645 |
all-mpnet-base-v2 | 0.726 | 0.835 | 0.78 | 0.857 | 0.8 | 0.906 | 0.513 | 0.875 | 0.656 |
ada-embedding-002 | 0.698 | 0.833 | 0.761 | 0.861 | 0.86 | 0.903 | 0.685 | 0.876 | 0.726 |
jina-embedding-s-en-v1 | 0.738 | 0.781 | 0.732 | 0.833 | 0.785 | 0.859 | 0.471 | 0.852 | 0.567 |
jina-embedding-b-en-v1 | 0.736 | 0.804 | 0.745 | 0.844 | 0.793 | 0.873 | 0.481 | 0.87 | 0.616 |
jina-embedding-l-en-v1 | 0.736 | 0.832 | 0.762 | 0.846 | 0.805 | 0.885 | 0.477 | 0.876 | 0.65 |
For more tasks and metrics, please checkout MTEB benchmark.
Usage
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-l-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
Fine-tuning
Please consider Finetuner.
Plans
- The development of
jina-embedding-s-en-v2
is currently underway with two main objectives: improving performance and increasing the maximum sequence length. - We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called
jina-embedding-s/b/l-de-v1
.
Contact
Join our Discord community and chat with other community members about ideas.