bwang0911's picture
Update README.md
bc9ceaa
metadata
license: apache-2.0
language:
  - en
inference: false



Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The text embedding suit trained by Jina AI, Finetuner team.

Intented Usage & Model Info

jina-embedding-s-en-v1 is a language model that has been trained using Jina AI's Linnaeus-Clean dataset. This dataset consists of 380 million pairs of sentences, which include both query-document pairs. These pairs were obtained from various domains and were carefully selected through a thorough cleaning process. The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.

The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.

With a compact size of just 35 million parameters, the model enables lightning-fast inference while still delivering impressive performance. Additionally, we provide the following options:

  • jina-embedding-s-en-v1: 35 million parameters (you are here).
  • jina-embedding-b-en-v1: 110 million parameters.
  • jina-embedding-l-en-v1: 330 million parameters.
  • jina-embedding-1b-en-v1: 1.2 billion parameters, 10* bert-base size (soon).
  • jina-embedding-6b-en-v1: 6 billion parameters 30* bert-base size(soon).

Data & Parameters

More info will be released together with the technique report.

Metrics

We compared the model against all-minilm-l6-v2/all-mpnet-base-v2 from sbert and text-embeddings-ada-002 from OpenAI:

Name param context
all-minilm-l6-v2 33m 128
all-mpnet-base-v2 110m 128
ada-embedding-002 Unknown/API based 8192
jina-embedding-s-en-v1 35m 512
jina-embedding-b-en-v1 110m 512
jina-embedding-l-en-v1 330m 512
Name STS12 STS13 STS14 STS15 STS16 STS17 TRECOVID Quora SciFact
all-minilm-l6-v2 0.724 0.806 0.756 0.854 0.79 0.876 0.473 0.876 0.645
all-mpnet--base-v2 0.726 0.835 0.78 0.857 0.8 0.906 0.513 0.875 0.656
ada-embedding-002 0.698 0.833 0.761 0.861 0.86 0.903 0.685 0.876 0.726
jina-embedding-s-en-v1 0.738 0.781 0.732 0.833 0.785 0.859 0.471 0.852 0.567
jina-embedding-b-en-v1 0.736 0.804 0.745 0.844 0.793 0.873 0.481 0.87 0.616
jina-embedding-l-en-v1 0.735 0.829 0.759 0.844 0.8 0.888 0.465 0.876 0.645

For more tasks and metrics, please checkout MTEB benchmark.

Usage [WIP]

!pip install finetuner[text]
import finetuner
model = finetuner.get_model('jinaai/jina-embedding-s-en-v1')
embeddings = model.encode(['sentence 1', 'sentence 2'])

Fine-tuning [WIP]

Please consider Finetuner.