jinaai
/

jina-embedding-s-en-v1

Sentence Similarity

sentence-transformers

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

jina-embedding-s-en-v1 / README.md

bwang0911's picture

Update README.md

bc9ceaa 12 months ago

|

raw history blame

3.2 kB

	---
	license: apache-2.0
	language:
	- en
	inference: false
	---

	<br><br>

	<p align="center">
	<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
	</p>


	<p align="center">
	<b>The text embedding suit trained by Jina AI, Finetuner team.</b>
	</p>


	## Intented Usage & Model Info

	`jina-embedding-s-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
	This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
	These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
	The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.

	The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.

	With a compact size of just 35 million parameters,
	the model enables lightning-fast inference while still delivering impressive performance.
	Additionally, we provide the following options:

	- `jina-embedding-s-en-v1`: 35 million parameters (you are here).
	- `jina-embedding-b-en-v1`: 110 million parameters.
	- `jina-embedding-l-en-v1`: 330 million parameters.
	- `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10* bert-base size (soon).
	- `jina-embedding-6b-en-v1`: 6 billion parameters 30* bert-base size(soon).

	## Data & Parameters

	More info will be released together with the technique report.

	## Metrics

	We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:

	\|Name\|param \|context\|
	\|------------------------------\|-----\|------\|
	\|all-minilm-l6-v2\|33m \|128\|
	\|all-mpnet-base-v2 \|110m \|128\|
	\|ada-embedding-002\|Unknown/API based \|8192\|
	\|jina-embedding-s-en-v1\|35m \|512\|
	\|jina-embedding-b-en-v1\|110m \|512\|
	\|jina-embedding-l-en-v1\|330m \|512\|


	\|Name\|STS12\|STS13\|STS14\|STS15\|STS16\|STS17\|TRECOVID\|Quora\|SciFact\|
	\|------------------------------\|-----\|-----\|-----\|-----\|-----\|-----\|--------\|-----\|-----\|
	\|all-minilm-l6-v2\|0.724\|0.806\|0.756\|0.854\|0.79 \|0.876\|0.473 \|0.876\|0.645 \|
	\|all-mpnet--base-v2\|0.726\|0.835\|0.78 \|0.857\|0.8 \|0.906\|0.513 \|0.875\|0.656 \|
	\|ada-embedding-002\|0.698\|0.833\|0.761\|0.861\|0.86 \|0.903\|0.685 \|0.876\|0.726 \|
	\|jina-embedding-s-en-v1\|0.738\|0.781\|0.732\|0.833\|0.785\|0.859\|0.471 \|0.852\|0.567 \|
	\|jina-embedding-b-en-v1\|0.736\|0.804\|0.745\|0.844\|0.793\|0.873\|0.481 \|0.87\|0.616 \|
	\|jina-embedding-l-en-v1\|0.735\|0.829\|0.759\|0.844\|0.8\|0.888\|0.465 \|0.876\|0.645 \|

	For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.

	## Usage [WIP]

	```python
	!pip install finetuner[text]
	import finetuner
	model = finetuner.get_model('jinaai/jina-embedding-s-en-v1')
	embeddings = model.encode(['sentence 1', 'sentence 2'])
	```

	## Fine-tuning [WIP]

	Please consider [Finetuner](https://github.com/jina-ai/finetuner).