Sentence Similarity
sentence-transformers
PyTorch
Transformers
Japanese
luke
feature-extraction
akiFQC commited on
Commit
2beb751
1 Parent(s): d05d4b2
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md CHANGED
@@ -1,3 +1,106 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: sentence-similarity
3
+ language: ja
4
  license: apache-2.0
5
+ tags:
6
+ - transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - sentence-transformers
10
+ inference: false
11
+ datasets:
12
+  - mc4
13
+ - clips/mqa
14
+ - shunk031/JGLUE
15
+ - paws-x
16
+ - hpprc/janli
17
+ - MoritzLaurer/multilingual-NLI-26lang-2mil7
18
+ - castorini/mr-tydi
19
+ - hpprc/jsick
20
  ---
21
+
22
+
23
+ # GLuCoSE (General Luke-based COntrastive Sentence Embedding)-base-Japanese
24
+
25
+ GLuCoSE (General LUke-based COntrastive Sentence Embedding, "glucose") is a Japanese text embedding model based on [LUKE](https://github.com/studio-ousia/luke). In order to create a general-purpose, user-friendly Japanese text embedding model, GLuCoSE has been trained on a mix of web data and various datasets associated with natural language inference and search. This model is not only suitable for sentence vector similarity tasks but also for semantic search tasks.
26
+ - Maximum token count: 512
27
+ - Output dimension: 768
28
+ - Pooling: mean pooling
29
+ - Supported language: Japanese
30
+
31
+ ## Usage
32
+ You can use this model easily with [sentence-transformers](https://www.SBERT.net).
33
+
34
+ First, install sentence-transformers with pip as follows:
35
+
36
+ ```
37
+ pip install -U sentence-transformers
38
+ ```
39
+
40
+ You can load the model and convert sentences into dense vectors as shown below:
41
+
42
+ ```python
43
+ from sentence_transformers import SentenceTransformer
44
+ sentences = [
45
+ "PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
46
+ "この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
47
+ "広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。",
48
+ ]
49
+
50
+ model = SentenceTransformer('pkshatech/GLuCoSE-base-ja')
51
+ embeddings = model.encode(sentences)
52
+ print(embeddings)
53
+ ```
54
+
55
+ Since the loss function used during training is cosine similarity, we recommend using cosine similarity for downstream tasks.
56
+
57
+ This text embedding model can also be used in LangChain. Please refer to [this page](https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers) for more information.
58
+
59
+ ## Resources Used
60
+ The following resources were used to train this model.
61
+ ### Pre-trained model
62
+ - [studio-ousia/luke-japanese-base-lite](https://huggingface.co/studio-ousia/luke-japanese-base-lite)
63
+
64
+ ### Datasets
65
+ - [mC4](https://huggingface.co/datasets/mc4)
66
+ - [MQA](https://huggingface.co/datasets/clips/mqa)
67
+ - [JNLI](https://github.com/yahoojapan/JGLUE)
68
+ - [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)
69
+ - [JaNLI](https://github.com/verypluming/JaNLI)
70
+ - [PAWS-X](https://huggingface.co/datasets/paws-x)
71
+ - [JSeM](https://github.com/DaisukeBekki/JSeM)
72
+ - [MoritzLaurer/multilingual-NLI-26lang-2mil7](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7)
73
+ - [MultiNLI](https://huggingface.co/datasets/multi_nli)
74
+ - [WANLI](https://huggingface.co/datasets/alisawuffles/WANLI)
75
+ - [FeverNLI](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md)
76
+ - [LingNLI](https://arxiv.org/pdf/2104.07179.pdf)
77
+ - [JSICK](https://github.com/verypluming/JSICK)
78
+ - [Mr.Tidy](https://huggingface.co/datasets/castorini/mr-tydi)
79
+ - [JSTS](https://github.com/yahoojapan/JGLUE) (used for validation) [^1]
80
+
81
+ ## Benchmarks
82
+ ### Semantic Similarity Calculation ([JSTS](https://github.com/yahoojapan/JGLUE) dev set)
83
+ Evaluation by Spearman's correlation coefficient and Pearson's correlation coefficient.
84
+ | Model | Spearman | Pearson |
85
+ | --- | --- | --- |
86
+ | [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings) |0.837[^2] | 0.790[^2] |
87
+ | [pkshatech/simcse-ja-bert-base-clcmlp](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp)[^3] | 0.850 | 0.801 |
88
+ | pkshatech/GLuCoSE-base-ja | **0.865** | **0.821** |
89
+
90
+ ### Zero-shot Search ([AIO3](https://sites.google.com/view/project-aio/competition3?authuser=0) dev set)
91
+ Evaluation by top-k retrieval accuracy[^4] (the fraction of questions that have a correct answer in the top-k retrieved documents at least once.)
92
+ | Model | Top-1 | Top-5 | Top-10 | Top-50 |
93
+ | --- | --- | --- | --- | --- |
94
+ | [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings) | 33.50 | **57.80** | 65.10 | 76.60 |
95
+ | [pkshatech/simcse-ja-bert-base-clcmlp](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp)[^3] | 30.60 | 54.50 | 62.50 | 76.70 |
96
+ | pkshatech/GLuCoSE-base-ja | **34.20** | 57.50 | **65.20** | **77.60** |
97
+
98
+ ## License
99
+ This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
100
+
101
+ [^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.
102
+ [^2]: https://qiita.com/akeyhero/items/ce371bfed64399027c23
103
+ [^3]: This is the model that PKSHA Technology published before this model.
104
+ [^4]: For more details, please refer to https://arxiv.org/pdf/2004.04906.pdf.
105
+
106
+ ```
README_JA.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language: ja
4
+ license: apache-2.0
5
+ tags:
6
+ - transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - sentence-transformers
10
+ inference: false
11
+ ---
12
+
13
+ # GLuCoSE (General Luke-based COntrastive Sentence Embedding)-base-Japanese
14
+
15
+ GLuCoSE (General LUke-based COntrastive Sentence Embedding, "ぐるこーす")は[LUKE](https://github.com/studio-ousia/luke)をベースにした日本語のテキスト埋め込みモデルです。汎用的で気軽に使えるテキスト埋め込みモデルを目指して、Webデータと自然言語推論や検索などの複数のデータセットを組み合わせたデータで学習されています。文ベクトルの類似度タスクだけでなく意味検索タスクにもお使いいただけます。
16
+ - 最大トークン数: 512
17
+ - 出力次元数: 768
18
+ - pooling: mean pooling
19
+ - 対応言語: 日本語
20
+
21
+
22
+ ## 使い方
23
+ [sentence-transformers](https://www.SBERT.net)を使って、このモデルを簡単に利用することができます。
24
+
25
+ 下記のように、pipでsentence-transformersをインストールしてください。
26
+
27
+ ```
28
+ pip install -U sentence-transformers
29
+ ```
30
+
31
+ 下記のようにモデルをロードして、文を密なベクトルに変換することができます。
32
+
33
+ ```python
34
+ from sentence_transformers import SentenceTransformer
35
+ sentences = [
36
+ "PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
37
+ "この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
38
+ "広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。",
39
+ ]
40
+
41
+ model = SentenceTransformer('pkshatech/GLuCoSE-base-ja')
42
+ embeddings = model.encode(sentences)
43
+ print(embeddings)
44
+ ```
45
+
46
+ 学習時の損失関数にcosine類似度を使っているため、下流のタスクでcosine類似度を類似度計算に使うことをおすすめします。
47
+
48
+ LangChainでもこのテキスト埋め込みモデルを利用することができます。[こちらのページ](https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers)を参考にしてください。
49
+
50
+ ## 使用したリソース
51
+ このモデルの学習に下記の資源を使用しています。
52
+ ### 事前学習モデル
53
+ - [studio-ousia/luke-japanese-base-lite](https://huggingface.co/studio-ousia/luke-japanese-base-lite)
54
+
55
+ ### データセット
56
+ - [mC4](https://huggingface.co/datasets/mc4)
57
+ - [MQA](https://huggingface.co/datasets/clips/mqa)
58
+ - [JNLI](https://github.com/yahoojapan/JGLUE)
59
+ - [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)
60
+ - [JaNLI](https://github.com/verypluming/JaNLI)
61
+ - [PAWS-X](https://huggingface.co/datasets/paws-x)
62
+ - [JSeM](https://github.com/DaisukeBekki/JSeM)
63
+ - [MoritzLaurer/multilingual-NLI-26lang-2mil7](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7)
64
+ - [MultiNLI](https://huggingface.co/datasets/multi_nli)
65
+ - [WANLI](https://huggingface.co/datasets/alisawuffles/WANLI)
66
+ - [FeverNLI](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md)
67
+ - [LingNLI](https://arxiv.org/pdf/2104.07179.pdf)
68
+ - [JSICK](https://github.com/verypluming/JSICK)
69
+ - [Mr.Tidy](https://huggingface.co/datasets/castorini/mr-tydi)
70
+ - [JSTS](https://github.com/yahoojapan/JGLUE) (validasion用) [^1]
71
+
72
+ ## ベンチマーク
73
+ ### 意味的類似度計算 ([JSTS](https://github.com/yahoojapan/JGLUE) dev set)
74
+ Spearmanの相関係数、Pearsonの相関係数で評価
75
+ | モデル | Spearman |Pearson|
76
+ | --- | --- | --- |
77
+ | [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings) | 0.837[^2] | 0.790[^2] |
78
+ | [pkshatech/simcse-ja-bert-base-clcmlp](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp)[^3] | 0.850 | 0.801 |
79
+ | pkshatech/GLuCoSE-base-ja | **0.865** | **0.821** |
80
+
81
+ ### zero-shot 検索([第3回AI王](https://sites.google.com/view/project-aio/competition3?authuser=0) dev set)
82
+ top-k retrieval accuracy[^4] (検索された上位k個の文書に少なくとも1回は正解が含まれる問題の割合)で評価
83
+ | モデル | Top-1 | Top-5| Top-10| Top-50|
84
+ | --- | --- | --- |---|---|
85
+ | [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings) |33.50 | **57.80** | 65.10| 76.60|
86
+ | [pkshatech/simcse-ja-bert-base-clcmlp](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp)[^3] | 30.60 | 54.50 | 62.50 | 76.70 |
87
+ | pkshatech/GLuCoSE-base-ja | **34.20** | 57.50 | **65.20** | **77.60** |
88
+
89
+ ## ライセンス
90
+ このモデルは [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)の下で公開されています。
91
+
92
+ [^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな��価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。
93
+ [^2]: https://qiita.com/akeyhero/items/ce371bfed64399027c23
94
+ [^3]: このモデルの前にPKSHA Technologyが公開したモデルです。
95
+ [^4]: 詳しくは、https://arxiv.org/pdf/2004.04906.pdf を参照してください。
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "<ent2>": 32771,
3
+ "<ent>": 32770
4
+ }
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "models/luke-japanese/hf_xlm_roberta",
3
+ "architectures": [
4
+ "LukeModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bert_model_name": "models/luke-japanese/hf_xlm_roberta",
8
+ "bos_token_id": 0,
9
+ "classifier_dropout": null,
10
+ "cls_entity_prediction": false,
11
+ "entity_emb_size": 256,
12
+ "entity_vocab_size": 4,
13
+ "eos_token_id": 2,
14
+ "hidden_act": "gelu",
15
+ "hidden_dropout_prob": 0.1,
16
+ "hidden_size": 768,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 3072,
19
+ "layer_norm_eps": 1e-05,
20
+ "max_position_embeddings": 514,
21
+ "model_type": "luke",
22
+ "num_attention_heads": 12,
23
+ "num_hidden_layers": 12,
24
+ "pad_token_id": 1,
25
+ "position_embedding_type": "absolute",
26
+ "torch_dtype": "float32",
27
+ "transformers_version": "4.28.1",
28
+ "type_vocab_size": 1,
29
+ "use_cache": true,
30
+ "use_entity_aware_attention": true,
31
+ "vocab_size": 32772
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.28.1",
5
+ "pytorch": "2.0.0+cu117"
6
+ }
7
+ }
entity_vocab.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "[MASK2]": 3,
3
+ "[MASK]": 0,
4
+ "[PAD]": 2,
5
+ "[UNK]": 1
6
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3c698698882f31d4ce74494bcc357c7fe0525c4401907520dd529dffc5b61bc
3
+ size 532361569
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8b73a5e054936c920cf5b7d1ec21ce9c281977078269963beb821c6c86fbff7
3
+ size 841889
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<ent>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<ent2>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ }
17
+ ],
18
+ "bos_token": "<s>",
19
+ "cls_token": "<s>",
20
+ "eos_token": "</s>",
21
+ "mask_token": {
22
+ "content": "<mask>",
23
+ "lstrip": true,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "pad_token": "<pad>",
29
+ "sep_token": "</s>",
30
+ "unk_token": "<unk>"
31
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "__type": "AddedToken",
5
+ "content": "<ent>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ {
12
+ "__type": "AddedToken",
13
+ "content": "<ent2>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false
18
+ },
19
+ {
20
+ "__type": "AddedToken",
21
+ "content": "<ent>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ {
28
+ "__type": "AddedToken",
29
+ "content": "<ent2>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ {
36
+ "__type": "AddedToken",
37
+ "content": "<ent>",
38
+ "lstrip": false,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false
42
+ },
43
+ {
44
+ "__type": "AddedToken",
45
+ "content": "<ent2>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ ],
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": true,
54
+ "cls_token": "<s>",
55
+ "entity_mask2_token": "[MASK2]",
56
+ "entity_mask_token": "[MASK]",
57
+ "entity_pad_token": "[PAD]",
58
+ "entity_token_1": {
59
+ "__type": "AddedToken",
60
+ "content": "<ent>",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": false
65
+ },
66
+ "entity_token_2": {
67
+ "__type": "AddedToken",
68
+ "content": "<ent2>",
69
+ "lstrip": false,
70
+ "normalized": true,
71
+ "rstrip": false,
72
+ "single_word": false
73
+ },
74
+ "entity_unk_token": "[UNK]",
75
+ "eos_token": "</s>",
76
+ "mask_token": {
77
+ "__type": "AddedToken",
78
+ "content": "<mask>",
79
+ "lstrip": true,
80
+ "normalized": true,
81
+ "rstrip": false,
82
+ "single_word": false
83
+ },
84
+ "max_entity_length": 32,
85
+ "max_mention_length": 30,
86
+ "model_max_length": 512,
87
+ "pad_token": "<pad>",
88
+ "sep_token": "</s>",
89
+ "sp_model_kwargs": {},
90
+ "task": null,
91
+ "tokenizer_class": "MLukeTokenizer",
92
+ "tokenizer_file": "models/luke-japanese/hf_luke_japanese_lite_epoch20/tokenizer.json",
93
+ "unk_token": "<unk>"
94
+ }