license: apache-2.0
language:
- en
inference: false
The text embedding suit trained by Jina AI, Finetuner team.
Intented Usage & Model Info
jina-embedding-b-en-v1
is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a standard size of 110 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference. Additionally, we provide the following options:
jina-embedding-s-en-v1
: 35 million parameters.jina-embedding-b-en-v1
: 110 million parameters (you are here).jina-embedding-l-en-v1
: 330 million parameters.jina-embedding-xl-en-v1
: 1.2 billion parameters (soon).jina-embedding-xxl-en-v1
: 6 billion parameters (soon).
Data & Parameters
More info will be released together with the technique report.
Metrics
We compared the model against all-minilm-l6-v2
/all-mpnet-base-v2
from sbert and text-embeddings-ada-002
from OpenAI:
Name | param | context |
---|---|---|
all-minilm-l6-v2 | 33m | 128 |
all-mpnet-base-v2 | 110m | 128 |
ada-embedding-002 | Unknown/API based | 8192 |
jina-embedding-s-en-v1 | 35m | 512 |
jina-embedding-b-en-v1 | 110m | 512 |
jina-embedding-l-en-v1 | 330m | 512 |
Name | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | TRECOVID | Quora | SciFact |
---|---|---|---|---|---|---|---|---|---|
all-minilm-l6-v2 | 0.724 | 0.806 | 0.756 | 0.854 | 0.79 | 0.876 | 0.473 | 0.876 | 0.645 |
all-mpnet--base-v2 | 0.726 | 0.835 | 0.78 | 0.857 | 0.8 | 0.906 | 0.513 | 0.875 | 0.656 |
ada-embedding-002 | 0.698 | 0.833 | 0.761 | 0.861 | 0.86 | 0.903 | 0.685 | 0.876 | 0.726 |
jina-embedding-s-en-v1 | 0.738 | 0.781 | 0.732 | 0.833 | 0.785 | 0.859 | 0.471 | 0.852 | 0.567 |
jina-embedding-b-en-v1 | 0.736 | 0.804 | 0.745 | 0.844 | 0.793 | 0.873 | 0.481 | 0.87 | 0.616 |
jina-embedding-l-en-v1 | 0.735 | 0.829 | 0.759 | 0.844 | 0.8 | 0.888 | 0.65 | 0.876 | 0.645 |
For more tasks and metrics, please checkout MTEB benchmark.
Usage [WIP]
!pip install finetuner[text]
import finetuner
model = finetuner.get_model('jinaai/jina-embedding-b-en-v1')
embeddings = model.encode(['sentence 1', 'sentence 2'])
Fine-tuning [WIP]
Please consider Finetuner.