fix readme
Browse files- README.md +7 -9
- README_JA.md +11 -11
README.md
CHANGED
@@ -109,13 +109,13 @@ To achieve generic text embedding performance across a wide range of domains, we
|
|
109 |
|web-crawled data (ours)|47,370,649|
|
110 |
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
|
111 |
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
|
112 |
-
|[
|
113 |
|Quiz dataset (ours)|988,478|
|
114 |
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
|
115 |
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
|
116 |
-
|[
|
117 |
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
|
118 |
-
|[
|
119 |
|||
|
120 |
|**total**|**126,744,763**|
|
121 |
|
@@ -130,7 +130,7 @@ To enable the model to learn a more accurate query-document similarity, we perfo
|
|
130 |
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
|
131 |
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
|
132 |
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
|
133 |
-
|[Natural
|
134 |
|||
|
135 |
|**total**|**233,072**|
|
136 |
|
@@ -138,18 +138,16 @@ To enable the model to learn a more accurate query-document similarity, we perfo
|
|
138 |
|
139 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
140 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
141 |
-
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^
|
142 |
-
| [cl-nagoya/ruri-large](https://
|
143 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
144 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|
145 |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
|
146 |
|||
|
147 |
-
|[**
|
148 |
|
149 |
## License
|
150 |
|
151 |
This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
|
152 |
|
153 |
**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
|
154 |
-
|
155 |
-
[^1]: Benchmarked on April 23, 2024.
|
|
|
109 |
|web-crawled data (ours)|47,370,649|
|
110 |
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
|
111 |
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
|
112 |
+
|[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
|
113 |
|Quiz dataset (ours)|988,478|
|
114 |
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
|
115 |
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
|
116 |
+
|[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758|
|
117 |
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
|
118 |
+
|[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318|
|
119 |
|||
|
120 |
|**total**|**126,744,763**|
|
121 |
|
|
|
130 |
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
|
131 |
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
|
132 |
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
|
133 |
+
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
|
134 |
|||
|
135 |
|**total**|**233,072**|
|
136 |
|
|
|
138 |
|
139 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
140 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
141 |
+
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
|
142 |
+
| [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
|
143 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
144 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|
145 |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
|
146 |
|||
|
147 |
+
|[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
|
148 |
|
149 |
## License
|
150 |
|
151 |
This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
|
152 |
|
153 |
**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
|
|
|
|
README_JA.md
CHANGED
@@ -96,7 +96,7 @@ print(similarities.shape)
|
|
96 |
|
97 |
### Stage 1: 弱教師あり学習
|
98 |
|
99 |
-
|
100 |
|
101 |
#### データセット
|
102 |
|
@@ -106,19 +106,19 @@ print(similarities.shape)
|
|
106 |
|web-crawled data (ours)|47,370,649|
|
107 |
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
|
108 |
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
|
109 |
-
|[
|
110 |
|Quiz dataset (ours)|988,478|
|
111 |
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
|
112 |
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
|
113 |
-
|[
|
114 |
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
|
115 |
-
|[
|
116 |
|||
|
117 |
|**total**|**126,744,763**|
|
118 |
|
119 |
### Stage 2: ファインチューニング
|
120 |
|
121 |
-
|
122 |
|
123 |
#### データセット
|
124 |
|
@@ -127,7 +127,7 @@ print(similarities.shape)
|
|
127 |
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
|
128 |
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
|
129 |
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
|
130 |
-
|[Natural
|
131 |
|||
|
132 |
|**total**|**233,072**|
|
133 |
|
@@ -135,18 +135,18 @@ print(similarities.shape)
|
|
135 |
|
136 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
137 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
138 |
-
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^
|
139 |
-
| [cl-nagoya/ruri-large](https://
|
140 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
141 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|
142 |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
|
143 |
|||
|
144 |
-
|[**
|
145 |
|
146 |
## ライセンス
|
147 |
|
148 |
このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
|
149 |
|
150 |
-
|
151 |
|
152 |
-
[^
|
|
|
96 |
|
97 |
### Stage 1: 弱教師あり学習
|
98 |
|
99 |
+
幅広いドメインに対して汎用的かつ高い性能を持つ埋め込みモデルを構築するため、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いました。
|
100 |
|
101 |
#### データセット
|
102 |
|
|
|
106 |
|web-crawled data (ours)|47,370,649|
|
107 |
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
|
108 |
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
|
109 |
+
|[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
|
110 |
|Quiz dataset (ours)|988,478|
|
111 |
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
|
112 |
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
|
113 |
+
|[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758|
|
114 |
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
|
115 |
+
|[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318|
|
116 |
|||
|
117 |
|**total**|**126,744,763**|
|
118 |
|
119 |
### Stage 2: ファインチューニング
|
120 |
|
121 |
+
より正確なクエリ-ドキュメント間の類似度をモデルに学習させるために、私たちは以下のデータセットでファインチューニングを行いました。
|
122 |
|
123 |
#### データセット
|
124 |
|
|
|
127 |
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
|
128 |
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
|
129 |
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
|
130 |
+
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
|
131 |
|||
|
132 |
|**total**|**233,072**|
|
133 |
|
|
|
135 |
|
136 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
137 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
138 |
+
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
|
139 |
+
| [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
|
140 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
141 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|
142 |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
|
143 |
|||
|
144 |
+
|[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
|
145 |
|
146 |
## ライセンス
|
147 |
|
148 |
このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
|
149 |
|
150 |
+
**もしこのモデルの商用利用にご興味がある場合は、お気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)へご連絡ください。**
|
151 |
|
152 |
+
[^oai]: Benchmarked on April 23, 2024.
|