Update README.md
Browse files
README.md
CHANGED
@@ -2621,31 +2621,34 @@ model-index:
|
|
2621 |
|
2622 |
## Intended Usage & Model Info
|
2623 |
|
2624 |
-
`jina-embedding-s-en-v2` is an English, monolingual embedding model supporting
|
2625 |
-
It is based on a Bert architecture that supports the symmetric bidirectional variant of ALiBi to support longer sequence length.
|
2626 |
-
The backbone
|
2627 |
-
The model is further trained on Jina AI's collection of more than
|
2628 |
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
2629 |
|
2630 |
-
The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length thanks to ALiBi.
|
2631 |
This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search,...
|
2632 |
|
2633 |
-
This model has 33 million parameters, which enables lightning-fast and memory efficient inference
|
2634 |
Additionally, we provide the following embedding models, supporting 8k sequence length as well:
|
2635 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2636 |
- [`jina-embedding-s-en-v2`](https://huggingface.co/jinaai/jina-embedding-s-en-v2): 33 million parameters **(you are here)**.
|
2637 |
- [`jina-embedding-b-en-v2`](https://huggingface.co/jinaai/jina-embedding-b-en-v2): 137 million parameters.
|
2638 |
- [`jina-embedding-l-en-v2`](https://huggingface.co/jinaai/jina-embedding-l-en-v2): 435 million parameters.
|
2639 |
|
2640 |
## Data & Parameters
|
2641 |
|
2642 |
-
|
2643 |
-
|
2644 |
-
## Metrics
|
2645 |
-
|
2646 |
-
We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
|
2647 |
-
|
2648 |
-
<!-- TODO: add evaluation table here -->
|
2649 |
|
2650 |
## Usage
|
2651 |
|
@@ -2682,7 +2685,8 @@ print(cos_sim(embeddings[0], embeddings[1]))
|
|
2682 |
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
|
2683 |
|
2684 |
## Plans
|
2685 |
-
|
|
|
2686 |
|
2687 |
## Contact
|
2688 |
|
|
|
2621 |
|
2622 |
## Intended Usage & Model Info
|
2623 |
|
2624 |
+
`jina-embedding-s-en-v2` is an English, monolingual **embedding model supporting 8192 sequence length**.
|
2625 |
+
It is based on a Bert architecture (Jina Bert) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to support longer sequence length.
|
2626 |
+
The backbone `jina-bert-s-en-v2` is pretrained on the C4 dataset.
|
2627 |
+
The model is further trained on Jina AI's collection of more than 400 millions of sentence pairs and hard negatives.
|
2628 |
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
2629 |
|
2630 |
+
The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
|
2631 |
This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search,...
|
2632 |
|
2633 |
+
This model has 33 million parameters, which enables lightning-fast and memory efficient inference, while still delivering impressive performance.
|
2634 |
Additionally, we provide the following embedding models, supporting 8k sequence length as well:
|
2635 |
|
2636 |
+
### V1 (Based on T5)
|
2637 |
+
|
2638 |
+
- [`jina-embedding-s-en-v1`](https://huggingface.co/jinaai/jina-embedding-s-en-v1): 35 million parameters.
|
2639 |
+
- [`jina-embedding-b-en-v1`](https://huggingface.co/jinaai/jina-embedding-b-en-v1): 110 million parameters.
|
2640 |
+
- [`jina-embedding-l-en-v1`](https://huggingface.co/jinaai/jina-embedding-l-en-v1): 330 million parameters.
|
2641 |
+
|
2642 |
+
### V2 (Based on JinaBert)
|
2643 |
+
|
2644 |
- [`jina-embedding-s-en-v2`](https://huggingface.co/jinaai/jina-embedding-s-en-v2): 33 million parameters **(you are here)**.
|
2645 |
- [`jina-embedding-b-en-v2`](https://huggingface.co/jinaai/jina-embedding-b-en-v2): 137 million parameters.
|
2646 |
- [`jina-embedding-l-en-v2`](https://huggingface.co/jinaai/jina-embedding-l-en-v2): 435 million parameters.
|
2647 |
|
2648 |
## Data & Parameters
|
2649 |
|
2650 |
+
Jina Embedding V2 technical report coming soon.
|
2651 |
+
Jina Embedding V1 [technical report](https://arxiv.org/abs/2307.11224).
|
|
|
|
|
|
|
|
|
|
|
2652 |
|
2653 |
## Usage
|
2654 |
|
|
|
2685 |
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
|
2686 |
|
2687 |
## Plans
|
2688 |
+
|
2689 |
+
The development of new bilingual models is currently underway. We will be targeting mainly the German and Spanish languages. The upcoming models will be called `jina-embedding-b-de/es-v2`.
|
2690 |
|
2691 |
## Contact
|
2692 |
|