sbintuitions
/

sarashina-embedding-v1-1b

@@ -109,13 +109,13 @@ To achieve generic text embedding performance across a wide range of domains, we
 |web-crawled data  (ours)|47,370,649|
 |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
 |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
-|[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
 |Quiz dataset (ours)|988,478|
 |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
 |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
-|[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
 |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
-|[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
 |||
 |**total**|**126,744,763**|
@@ -130,7 +130,7 @@ To enable the model to learn a more accurate query-document similarity, we perfo
 |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
 |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
 |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
-|[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
 |||
 |**total**|**233,072**|
@@ -138,18 +138,16 @@ To enable the model to learn a more accurate query-document similarity, we perfo
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
 |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
-| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^1] | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
-| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
 | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
 | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
 | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
 |||
-|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
 ## License
 This model is licensed under  [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
 **If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
-[^1]: Benchmarked on April 23, 2024.

 |web-crawled data  (ours)|47,370,649|
 |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
 |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
+|[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
 |Quiz dataset (ours)|988,478|
 |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
 |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
+|[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758|
 |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
+|[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318|
 |||
 |**total**|**126,744,763**|
 |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
 |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
 |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
+|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
 |||
 |**total**|**233,072**|
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
 |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
+| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai]       | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
+| [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
 | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
 | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
 | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
 |||
+|[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
 ## License
 This model is licensed under  [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
 **If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**

README_JA.md CHANGED Viewed

@@ -96,7 +96,7 @@ print(similarities.shape)
 ### Stage 1: 弱教師あり学習
-幅広いドメインに対して汎用的なテキスト埋め込みの性能を達成するために、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いました。
 #### データセット
@@ -106,19 +106,19 @@ print(similarities.shape)
 |web-crawled data  (ours)|47,370,649|
 |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
 |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
-|[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
 |Quiz dataset (ours)|988,478|
 |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
 |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
-|[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
 |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
-|[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
 |||
 |**total**|**126,744,763**|
 ### Stage 2: ファインチューニング
-より正確なクエリ-ドキュメント間の類似度をモデルに学習させるために、私たちは以下のようなデータセットでファインチューニングを行いました。
 #### データセット
@@ -127,7 +127,7 @@ print(similarities.shape)
 |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
 |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
 |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
-|[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
 |||
 |**total**|**233,072**|
@@ -135,18 +135,18 @@ print(similarities.shape)
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
 |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
-| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^1] | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
-| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
 | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
 | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
 | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
 |||
-|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
 ## ライセンス
 このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
-**もしこのモデルの商用利用に興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。**
-[^1]: Benchmarked on April 23, 2024.

 ### Stage 1: 弱教師あり学習
+幅広いドメインに対して汎用的かつ高い性能を持つ埋め込みモデルを構築するため、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いました。
 #### データセット
 |web-crawled data  (ours)|47,370,649|
 |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
 |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
+|[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
 |Quiz dataset (ours)|988,478|
 |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
 |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
+|[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758|
 |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
+|[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318|
 |||
 |**total**|**126,744,763**|
 ### Stage 2: ファインチューニング
+より正確なクエリ-ドキュメント間の類似度をモデルに学習させるために、私たちは以下のデータセットでファインチューニングを行いました。
 #### データセット
 |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
 |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
 |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
+|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
 |||
 |**total**|**233,072**|
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
 |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
+| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai]       | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
+| [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
 | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
 | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
 | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
 |||
+|[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
 ## ライセンス
 このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
+**もしこのモデルの商用利用にご興味がある場合は、お気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)へご連絡ください。**
+[^oai]: Benchmarked on April 23, 2024.