sbintuitions
/

sarashina-embedding-v1-1b

+---
+language:
+- ja
+- en
+license_name: sarahina-non-commercial-license
+license_link: LICENSE
+tags:
+- transformers
+- sentence-similarity
+- feature-extraction
+- sentence-transformers
+pipeline_tag: sentence-similarity
+inference: false
+datasets:
+  - hpprc/emb
+  - hpprc/mqa-ja
+  - sentence-transformers/NQ-retrieval
+  - izumi-lab/llm-japanese-dataset
+  - shunk031/JGLUE
+  - cl-nagoya/ruri-dataset-ft
+---
+# sarashina-embedding-v1-1b
+**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
+"sarashina-embedding-v1-1b" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "Sarashina".
+We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
+This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
+- **Maximum Sequence Length:** 8192 tokens
+- **Output Dimensionality:** 1792 tokens
+- **Similarity Function:** Cosine Similarity
+- **Language:**  Japanese
+- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
+  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
+# Run inference
+sentences = [
+    '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
+    'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
+    '更科蕎麦とはなんですか?'
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 1792]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+**Note**
+- You do not need to add prefixes such as "Query: " and "Document: " at the beginning of the input sentence.
+- This model is licensed under the [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE), which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).
+## Training
+sarashina-embedding-v1-1b is created through the following two-stage learning process:
+### Stage 1: Weakly-supervised Learning
+To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data.
+#### Dataset
+|dataset|counts|
+|:-:|:-:|
+|AutoWikiQA|50,521,135|
+|web-crawled data|47,370,649|
+|MQA|12,941,472|
+|llm-japanese-dataset|9,074,340|
+|wikipedia|5,555,212|
+|Quiz dataset|988,478|
+|Natural Questions|132,796|
+|JSQuAD|62,859|
+|snow|62,758|
+|JaQuAD|31,746|
+|mkqa|3,318|
+|||
+|**total**|**126,744,763**|
+### Step2: Supervised Fine-tuning
+To enable the model to learn a more accurate query-document similarity, We performed supervised fine-tuning using the following dataset.
+# Benchmarks
+### [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
+ Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
+|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
+| OpenAI/text-embedding-3-large                 | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
+| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
+| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
+| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
+| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
+|||
+|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
+## License
+This model is licensed under  [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
+**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**

README_JA.md ADDED Viewed

	@@ -0,0 +1,140 @@

+---
+language:
+- ja
+license_name: sarahina-non-commercial-license
+license_link: LICENSE
+tags:
+- transformers
+- sentence-similarity
+- feature-extraction
+- sentence-transformers
+inference: false
+---
+# sarashina-embedding-v1-1b
+「sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「Sarashina」をベースにした日本語テキスト埋め込みモデルです。
+このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)（Japanese Massive Text Embedding Benchmark）の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
+このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。
+## モデル詳細
+### モデル説明
+- **モデルタイプ:** Sentence Transformer
+- **最大シーケンス長:** 8192トークン
+- **出力次元数:** 1792トークン
+- **類似度関数:** コサイン類似度
+- **言語:** 日本語
+- **ライセンス:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
+### モデルアーキテクチャ
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
+  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
+)
+```
+## 使用方法
+### Sentence Transformersを使う方法
+まず、Sentence Transformersライブラリをインストールします。
+```bash
+pip install -U sentence-transformers
+```
+次に、このモデルをロードし、推論を実行します。
+```python
+from sentence_transformers import SentenceTransformer
+# 🤗 Hubからモデルをダウンロードする
+model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
+# 推論を実行する
+sentences = [
+    '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
+    'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
+    '更科蕎麦とはなんですか?'
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 1792]
+# 埋め込みの類似度スコアを取得する
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+**注意**
+- "Query: ", "Document: "などのprefixを入力文の先頭に加える必要はありません。
+- このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)でライセンスされており、商用利用には制限があります。もしあなたのビジネスでこのモデルを活用することに興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。
+## 学習
+sarashina-embedding-v1-1bは、以下の2段階の学習ステージによって行われています。
+### Stage 1: 弱教師あり学習
+幅広いドメインに対して汎用的なテキスト埋め込みの性能を達成するために、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いまいした。
+#### データセット
+|dataset|counts|
+|:-:|:-:|
+|AutoWikiQA|50,521,135|
+|web-crawled data|47,370,649|
+|MQA|12,941,472|
+|llm-japanese-dataset|9,074,340|
+|wikipedia|5,555,212|
+|Quiz dataset|988,478|
+|Natural Questions|132,796|
+|JSQuAD|62,859|
+|snow|62,758|
+|JaQuAD|31,746|
+|mkqa|3,318|
+|||
+|**total**|**126,744,763**|
+### Stage 2: ファインチューニング
+より正確なクエリ-ドキュメント間の類似度をモデルに学習させるために、私たちは以下のようなデータセットでファインチューニングを行いました。
+#### データセット
+|dataset|counts|
+|:-:|:-:|
+|JSNLI|141,388 |
+|NU-MNLI|67,987|
+|Mr. TyDi (only Japanese subset)| 3,697 |
+|Natural Question (sampled)| 20,000|
+|||
+|**total**|**233,072**|
+## ベンチマーク
+### [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
+ Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
+|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
+| OpenAI/text-embedding-3-large                 | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
+| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
+| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
+| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
+| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
+|||
+|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
+## ライセンス
+このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
+**もしこのモデルの商用利用に興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。**