noriyukipy commited on
Commit
00587e4
1 Parent(s): 5f9a2b4

Add models and model card

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
CHANGELOG.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
1
+ # Changelog
2
+
3
+ ## [Unreleased]
4
+
5
+ ### Added
6
+
7
+ - models and model card
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ja
3
+ pipeline_tag: sentence-similarity
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ widget:
9
+ - text: "検索文"
10
+ license: cc-by-sa-4.0
11
+ ---
12
+
13
+ # Sentence BERT base Japanese model
14
+
15
+ This repository contains a Sentence BERT base model for Japanese.
16
+
17
+ ## Pretrained model
18
+
19
+ Pretrained BERT model [colorfulscoop/bert-base-ja](https://huggingface.co/colorfulscoop/bert-base-ja) v1.0 is used
20
+
21
+ This model is trained on Japanese Wikipedia data and relased under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) .
22
+
23
+ ## Training data
24
+
25
+ [Japanese SNLI dataset](https://nlp.ist.i.kyoto-u.ac.jp/index.php?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) released under [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/) is used for training.
26
+
27
+ Original training dataset is splitted into train/valid dataset. Finally, follwoing data is prepared.
28
+
29
+ * Train data: 523,005 samples
30
+ * Valid data: 10,000 samples
31
+ * Test data: 3,916 samples
32
+
33
+ ## Model description
34
+
35
+ `SentenceTransformer` model from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library is used for training.
36
+ The model detail is as below.
37
+
38
+ ```py
39
+ >>> sentence_transformers.SentenceTransformer("colorfulscoop/sbert-base-ja")
40
+ SentenceTransformer(
41
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
42
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
43
+ )
44
+ ```
45
+
46
+ ## Training
47
+
48
+ This model finetuned [colorfulscoop/bert-base-ja](https://huggingface.co/colorfulscoop/bert-base-ja) with Softmax classifier of 3 labels of SNLI. AdamW optimizer with learning rate of 2e-05 linearly warmed-up in 10% of train data was used. The model was trained in 1 epoch with batch size 8.
49
+
50
+ Note: in a original paper of [Sentence BERT](https://arxiv.org/abs/1908.10084), a batch size of the model trained on SNLI and Multi-Genle NLI was 16. In this model, the dataset is around half smaller than the origial one, therefore the batch size was set to half of the original batch size of 16.
51
+
52
+ Trainind was conducted on Ubuntu 18.04.5 LTS with one RTX 2080 Ti.
53
+
54
+ After training, test set accuracy reached to 0.8529.
55
+
56
+ Training code is available in [a GitHub repository](https://github.com/colorfulscoop/sbert-ja).
57
+
58
+ ## Usage
59
+
60
+ First, install dependecies.
61
+
62
+ ```sh
63
+ $ pip install sentence-transformers==2.0.0
64
+ ```
65
+
66
+ Then initialize `SentenceTransformer` model and use `encode` method to convert to vectors.
67
+
68
+ ```py
69
+ >>> from sentence_transformers import SentenceTransformer
70
+ >>> model = SentenceTransformer("colorfulscoop/sbert-base-ja")
71
+ >>> sentences = ["外をランニングするのが好きです", "海外旅行に行くのが趣味です"]
72
+ >>> model.encode(sentences)
73
+ ```
74
+
75
+ ## License
76
+
77
+ All the models included in this repository are licensed under [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
78
+
79
+ **Disclaimer:** Use of this model is at your sole risk. Colorful Scoop makes no warranty or guarantee of any outputs from the model. Colorful Scoop is not liable for any trouble, loss, or damage arising from the model output.
80
+
81
+ **Author:** Colorful Scoop
added_tokens.json ADDED
@@ -0,0 +1 @@
 
1
+ {"[PAD]": 32000}
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/root/.cache/torch/sentence_transformers/colorfulscoop_bert-base-ja",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 2,
8
+ "cls_token_id": 2,
9
+ "eos_token_id": 3,
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "layer_norm_eps": 1e-12,
17
+ "mask_token_id": 4,
18
+ "max_position_embeddings": 512,
19
+ "model_type": "bert",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 0,
23
+ "position_embedding_type": "absolute",
24
+ "sep_token_id": 3,
25
+ "tokenizer_class": "DebertaV2Tokenizer",
26
+ "torch_dtype": "float32",
27
+ "transformers_version": "4.9.1",
28
+ "type_vocab_size": 2,
29
+ "unk_token_id": 1,
30
+ "use_cache": true,
31
+ "vocab_size": 32000
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.9.1",
5
+ "pytorch": "1.8.1+cu111"
6
+ }
7
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:083c11ec4ffd1b66fccf3e8eb596ac93b0828f18b2ed8cf5a1ff940146fc3a12
3
+ size 442553545
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]"}
spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6467857b4b0c77ded9bac7ad2fb5c16eb64e17e417ce46624dacac2bbb404fc
3
+ size 802713
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": false, "bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]", "split_by_punct": false, "sp_model_kwargs": {}, "special_tokens_map_file": "/root/.cache/torch/sentence_transformers/colorfulscoop_bert-base-ja/special_tokens_map.json", "tokenizer_file": null, "name_or_path": "/root/.cache/torch/sentence_transformers/colorfulscoop_bert-base-ja", "tokenizer_class": "DebertaV2Tokenizer"}