fix readme
Browse files- README.md +12 -1
- README_JA.md +1 -1
README.md
CHANGED
@@ -27,7 +27,7 @@ datasets:
|
|
27 |
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
|
28 |
|
29 |
"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
|
30 |
-
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
|
31 |
|
32 |
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
33 |
|
@@ -120,6 +120,17 @@ To achieve generic text embedding performance across a wide range of domains, we
|
|
120 |
|
121 |
To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
|
122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
123 |
# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
|
124 |
|
125 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
|
|
27 |
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
|
28 |
|
29 |
"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
|
30 |
+
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).
|
31 |
|
32 |
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
33 |
|
|
|
120 |
|
121 |
To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
|
122 |
|
123 |
+
#### Dataset
|
124 |
+
|
125 |
+
|dataset|counts|
|
126 |
+
|:-:|:-:|
|
127 |
+
|JSNLI|141,388 |
|
128 |
+
|NU-MNLI|67,987|
|
129 |
+
|Mr. TyDi (only Japanese subset)| 3,697 |
|
130 |
+
|Natural Question (sampled)| 20,000|
|
131 |
+
|||
|
132 |
+
|**total**|**233,072**|
|
133 |
+
|
134 |
# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
|
135 |
|
136 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
README_JA.md
CHANGED
@@ -24,7 +24,7 @@ datasets:
|
|
24 |
|
25 |
「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
|
26 |
|
27 |
-
このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
|
28 |
|
29 |
このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。
|
30 |
|
|
|
24 |
|
25 |
「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
|
26 |
|
27 |
+
このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
|
28 |
|
29 |
このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。
|
30 |
|