akiFQCint commited on
Commit
7b3afd8
1 Parent(s): 9849de0

fix readme

Browse files
Files changed (2) hide show
  1. README.md +7 -9
  2. README_JA.md +11 -11
README.md CHANGED
@@ -109,13 +109,13 @@ To achieve generic text embedding performance across a wide range of domains, we
109
  |web-crawled data (ours)|47,370,649|
110
  |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
111
  |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
112
- |[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
113
  |Quiz dataset (ours)|988,478|
114
  |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
115
  |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
116
- |[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
117
  |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
118
- |[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
119
  |||
120
  |**total**|**126,744,763**|
121
 
@@ -130,7 +130,7 @@ To enable the model to learn a more accurate query-document similarity, we perfo
130
  |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
131
  |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
132
  |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
133
- |[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
134
  |||
135
  |**total**|**233,072**|
136
 
@@ -138,18 +138,16 @@ To enable the model to learn a more accurate query-document similarity, we perfo
138
 
139
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
140
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
141
- | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^1] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
142
- | [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
143
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
144
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
145
  | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
146
  |||
147
- |[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
148
 
149
  ## License
150
 
151
  This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
152
 
153
  **If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
154
-
155
- [^1]: Benchmarked on April 23, 2024.
 
109
  |web-crawled data (ours)|47,370,649|
110
  |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
111
  |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
112
+ |[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
113
  |Quiz dataset (ours)|988,478|
114
  |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
115
  |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
116
+ |[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758|
117
  |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
118
+ |[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318|
119
  |||
120
  |**total**|**126,744,763**|
121
 
 
130
  |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
131
  |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
132
  |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
133
+ |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
134
  |||
135
  |**total**|**233,072**|
136
 
 
138
 
139
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
140
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
141
+ | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
142
+ | [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
143
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
144
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
145
  | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
146
  |||
147
+ |[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
148
 
149
  ## License
150
 
151
  This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
152
 
153
  **If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
 
 
README_JA.md CHANGED
@@ -96,7 +96,7 @@ print(similarities.shape)
96
 
97
  ### Stage 1: 弱教師あり学習
98
 
99
- 幅広いドメインに対して汎用的なテキスト埋め込みの性能を達成するために、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いました。
100
 
101
  #### データセット
102
 
@@ -106,19 +106,19 @@ print(similarities.shape)
106
  |web-crawled data (ours)|47,370,649|
107
  |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
108
  |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
109
- |[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
110
  |Quiz dataset (ours)|988,478|
111
  |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
112
  |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
113
- |[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
114
  |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
115
- |[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
116
  |||
117
  |**total**|**126,744,763**|
118
 
119
  ### Stage 2: ファインチューニング
120
 
121
- より正確なクエリ-ドキュメント間の類似度をモデルに学習させるために、私たちは以下のようなデータセットでファインチューニングを行いました。
122
 
123
  #### データセット
124
 
@@ -127,7 +127,7 @@ print(similarities.shape)
127
  |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
128
  |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
129
  |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
130
- |[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
131
  |||
132
  |**total**|**233,072**|
133
 
@@ -135,18 +135,18 @@ print(similarities.shape)
135
 
136
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
137
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
138
- | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^1] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
139
- | [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
140
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
141
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
142
  | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
143
  |||
144
- |[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
145
 
146
  ## ライセンス
147
 
148
  このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
149
 
150
- **もしこのモデルの商用利用に興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。**
151
 
152
- [^1]: Benchmarked on April 23, 2024.
 
96
 
97
  ### Stage 1: 弱教師あり学習
98
 
99
+ 幅広いドメインに対して汎用的かつ高い性能を持つ埋め込みモデルを構築するため、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いました。
100
 
101
  #### データセット
102
 
 
106
  |web-crawled data (ours)|47,370,649|
107
  |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
108
  |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
109
+ |[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
110
  |Quiz dataset (ours)|988,478|
111
  |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
112
  |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
113
+ |[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758|
114
  |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
115
+ |[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318|
116
  |||
117
  |**total**|**126,744,763**|
118
 
119
  ### Stage 2: ファインチューニング
120
 
121
+ より正確なクエリ-ドキュメント間の類似度をモデルに学習させるために、私たちは以下のデータセットでファインチューニングを行いました。
122
 
123
  #### データセット
124
 
 
127
  |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
128
  |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
129
  |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
130
+ |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
131
  |||
132
  |**total**|**233,072**|
133
 
 
135
 
136
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
137
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
138
+ | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
139
+ | [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
140
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
141
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
142
  | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
143
  |||
144
+ |[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
145
 
146
  ## ライセンス
147
 
148
  このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
149
 
150
+ **もしこのモデルの商用利用にご興味がある場合は、お気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)へご連絡ください。**
151
 
152
+ [^oai]: Benchmarked on April 23, 2024.