bclavie commited on
Commit
872d105
1 Parent(s): 0adf2f1

Initial commit

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ja
4
+ pipeline_tag: sentence-similarity
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+
11
+ ---
12
+
13
+ # fio-base-japanese-v0.1
14
+
15
+ 日本語版は近日公開予定です(日本語を勉強中なので、間違いはご容赦ください!)
16
+
17
+ fio-base-japanese-v0.1 is a proof of concept, and the first release of the Fio family of Japanese embeddings. It is based on [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) and trained on limited volumes of data on single GPU.
18
+
19
+ For more information, please refer to [my notes on Fio](https://ben.clavie.eu/fio).
20
+
21
+ #### Datasets
22
+
23
+ Similarity/Entailment:
24
+ - JSTS (train)
25
+ - JSNLI (train)
26
+ - JNLI (train)
27
+ - JSICK (train)
28
+
29
+ Retrieval:
30
+ - MMARCO (Multilingual Marco) (train, 124k sentence pairs, ~<2% of the full data)
31
+ - Mr.TyDI (train)
32
+ - MIRACL (train, 50% sample)
33
+ - ~~JSQuAD (train, 50% sample, no LLM enhancement)~~ JSQuAD is not used in the released version, to serve as an unseen test set.
34
+
35
+ #### Results
36
+
37
+ This is adapted and truncated (to keep only the most popular models) from [oshizo's benchmarking github repo](https://github.com/oshizo/JapaneseEmbeddingEval), please check it out for more information and give it a star as it was very useful!
38
+
39
+ Italic denotes best model for its size (base/large | 768/1024), bold denotes best overall.
40
+
41
+ | Model | JSTS valid-v1.1 | JSICK test | MIRACL dev | Average |
42
+ |-------------------------------------------------|-----------------|------------|------------|---------|
43
+ | bclavie/fio-base-japanese-v0.1 | **_0.863_** | **_0.894_** | 0.715 | _0.824_ |
44
+ | cl-nagoya/sup-simcse-ja-base | 0.809 | 0.827 | 0.527 | 0.721 |
45
+ | cl-nagoya/sup-simcse-ja-large | 0.831 | 0.831 | 0.507 | 0.723 |
46
+ | colorfulscoop/sbert-base-ja | 0.742 | 0.657 | 0.254 | 0.551 |
47
+ | intfloat/multilingual-e5-base | 0.796 | 0.806 | **0.845** | 0.816 |
48
+ | intfloat/multilingual-e5-large | 0.819 | 0.794 | **0.883** | **_0.832_** |
49
+ | pkshatech/GLuCoSE-base-ja | 0.818 | 0.757 | 0.692 | 0.755 |
50
+ | text-embedding-ada-002 | 0.790 | 0.789 | 0.7232 | 0.768 |
51
+
52
+
53
+ ## Usage (Sentence-Transformers)
54
+
55
+ This model is best used through [sentence-transformers](https://www.SBERT.net). If you don't have it, it's easy to install:
56
+
57
+ ```
58
+ pip install -U sentence-transformers
59
+ ```
60
+
61
+ Then you can use the model like this:
62
+
63
+ ```python
64
+ from sentence_transformers import SentenceTransformer
65
+ sentences = ["こんにちは、世界!", "文埋め込み最高!文埋め込み最高と叫びなさい", "極度乾燥しなさい"]
66
+
67
+ model = SentenceTransformer('bclavie/fio-base-japanese-v0.1')
68
+ embeddings = model.encode(sentences)
69
+ print(embeddings)
70
+ ```
71
+
72
+
73
+ ## Usage
74
+
75
+ If using for a retrieval task, you must prefix your query with `"関連記事を取得するために使用できるこの文の表現を生成します: "`.
76
+
77
+ ### Usage (HuggingFace Transformers)
78
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
79
+
80
+ ```python
81
+ from transformers import AutoTokenizer, AutoModel
82
+ import torch
83
+
84
+
85
+ def cls_pooling(model_output, attention_mask):
86
+ return model_output[0][:,0]
87
+
88
+
89
+ # Sentences we want sentence embeddings for
90
+ sentences = ['This is an example sentence', 'Each sentence is converted']
91
+
92
+ # Load model from HuggingFace Hub
93
+ tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
94
+ model = AutoModel.from_pretrained('{MODEL_NAME}')
95
+
96
+ # Tokenize sentences
97
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
98
+
99
+ # Compute token embeddings
100
+ with torch.no_grad():
101
+ model_output = model(**encoded_input)
102
+
103
+ # Perform pooling. In this case, cls pooling.
104
+ sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
105
+
106
+ print("Sentence embeddings:")
107
+ print(sentence_embeddings)
108
+ ```
109
+
110
+
111
+
112
+ ### Evaluation Results
113
+
114
+ <!--- Describe how your model was evaluated -->
115
+
116
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
117
+
118
+
119
+
120
+ ## Full Model Architecture
121
+ ```
122
+ SentenceTransformer(
123
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
124
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
125
+ )
126
+ ```
127
+
128
+ ## Citing & Authors
129
+
130
+ @misc{
131
+ bclavie-fio-embeddings,
132
+ author = {Benjamin Clavié},
133
+ title = {Fio Japanese Embeddings},
134
+ year = {2023},
135
+ howpublished = {\url{https://ben.clavie.eu/fio}}
136
+ }
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "fio-base-japanese-v0.1",
3
+ "architectures": ["BertModel"],
4
+ "attention_probs_dropout_prob": 0.1,
5
+ "classifier_dropout": null,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.1,
8
+ "hidden_size": 768,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 3072,
11
+ "layer_norm_eps": 1e-12,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "bert",
14
+ "num_attention_heads": 12,
15
+ "num_hidden_layers": 12,
16
+ "pad_token_id": 0,
17
+ "position_embedding_type": "absolute",
18
+ "torch_dtype": "float32",
19
+ "transformers_version": "4.36.1",
20
+ "type_vocab_size": 2,
21
+ "use_cache": false,
22
+ "vocab_size": 32768
23
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.36.1",
5
+ "pytorch": "2.1.0+cu118"
6
+ }
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d19517586fff72065b7577fc715463f433876b7b40b96fd5e4eb3a16f626f663
3
+ size 444851048
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": false,
47
+ "do_subword_tokenize": true,
48
+ "do_word_tokenize": true,
49
+ "jumanpp_kwargs": null,
50
+ "mask_token": "[MASK]",
51
+ "mecab_kwargs": {
52
+ "mecab_dic": "unidic_lite"
53
+ },
54
+ "model_max_length": 512,
55
+ "never_split": null,
56
+ "pad_token": "[PAD]",
57
+ "sep_token": "[SEP]",
58
+ "subword_tokenizer_type": "wordpiece",
59
+ "sudachi_kwargs": null,
60
+ "tokenizer_class": "BertJapaneseTokenizer",
61
+ "unk_token": "[UNK]",
62
+ "word_tokenizer_type": "mecab"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff