noriyukipy commited on
Commit
8629ef0
1 Parent(s): 39eee93

Add models and model card

Browse files
CHANGELOG.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
1
+ # Changelog
2
+
3
+ ## [Unreleased]
4
+
5
+ ### Added
6
+
7
+ - models and model card
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ja
3
+ datasets: wikipedia
4
+ widget:
5
+ - text: "得意な科目は[MASK]です"
6
+ license: cc-by-sa-3.0
7
+ ---
8
+
9
+ # BERT base Japanese model
10
+
11
+ This repository contains a BERT base model trained on Japanese Wikipedia dataset.
12
+
13
+ ## Training data
14
+
15
+ [Japanese Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:データベースダウンロード) dataset as of June 20, 2021 which is released under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) is used for training.
16
+ The dataset is splitted into three subsets - train, valid and test. Both tokenizer and model are trained with the train split.
17
+
18
+ ## Model description
19
+
20
+ The model architecture is the same as BERT base model (hidden_size: 768, num_hidden_layers: 12, num_attention_heads: 12, max_position_embeddings: 512) except for a vocabulary size.
21
+ The vocabulary size is set to 32,000 instead of an original size of 30,522.
22
+
23
+ For the model, `transformers.BertForPreTraining` is used.
24
+
25
+ ## Tokenizer description
26
+
27
+ [SentencePiece](https://github.com/google/sentencepiece) tokenizer is used as a tokenizer for this model.
28
+
29
+ While training, the tokenizer model was trained with 1,000,000 samples which were extracted from the train split.
30
+ The vocabulary size is set to 32,000. A `add_dummy_prefix` option is set to `True` because words are not separated by whitespaces in Japanese.
31
+
32
+ After training, the model is imported to `transformers.DebertaV2Tokenizer` because it supports SentencePiece models and its behavior is consistent when `use_fast` option is set to `True` or `False`.
33
+
34
+ **Note:**
35
+ The meaning of "consistent" here is as follows.
36
+ For example, AlbertTokenizer provides AlbertTokenizer and AlbertTokenizerFast. Fast model is used as default. However, the tokenization behavior between them is different and a behavior this mdoel expects is the verions of not fast.
37
+ Although `use_fast=False` option passing to AutoTokenier or pipeline solves this problem to force to use not fast version of the tokenizer, this option cannot be passed to config.json or model card.
38
+ Therefore unexpected behavior happens when using Inference API. To avoid this kind of problems, `transformers.DebertaV2Tokenizer` is used in this model.
39
+
40
+ ## Training
41
+
42
+ Training details are as follows.
43
+
44
+ * gradient update is every 256 samples (batch size: 8, accumulate_grad_batches: 32)
45
+ * gradient clip norm is 1.0
46
+ * Learning rate starts from 0 and linearly increased to 0.0001 in the first 10,000 steps
47
+ * The training set contains around 20M samples. Because 80k * 256 ~ 20M, 1 epochs has around 80k steps.
48
+
49
+ Trainind was conducted on Ubuntu 18.04.5 LTS with one RTX 2080 Ti.
50
+
51
+ The training continued until validation loss got worse. Totally the number of training steps were around 214k.
52
+ The test set loss was 2.80 .
53
+
54
+ Training code is available in [a GitHub repository](https://github.com/colorfulscoop/bert-ja).
55
+
56
+ ## Usage
57
+
58
+ First, install dependecies.
59
+
60
+ ```sh
61
+ $ pip install torch==1.8.0 transformers==4.8.2 sentencepiece==0.1.95 tensorflow==2.5.0
62
+ ```
63
+
64
+ Then use `transformers.pipeline` to try mask fill task.
65
+
66
+ ```sh
67
+ >>> import transformers
68
+ >>> pipeline = transformers.pipeline("fill-mask", "colorfulscoop/bert-base-ja")
69
+ >>> pipeline("専門として[MASK]を専攻しています")
70
+ [{'sequence': '専門として工学を専攻しています', 'score': 0.03630176931619644, 'token': 3988, 'token_str': '工学'}, {'sequence': '専門として政治学を専攻しています', 'score': 0.03547220677137375, 'token': 22307, 'token_str': '政治学'}, {'sequence': '専門として教育を専攻しています', 'score': 0.03162326663732529, 'token': 414, 'token_str': '教育'}, {'sequence': '専門として経済学を専攻しています', 'score': 0.026036914438009262, 'token': 6814, 'token_str': '経済学'}, {'sequence': '専門として法学を専攻しています', 'score': 0.02561848610639572, 'token': 10810, 'token_str': '法学'}]
71
+ ```
72
+
73
+ ## License
74
+
75
+ All the models included in this repository are licensed under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
76
+
77
+ **Disclaimer:** The model potentially has possibility that it generates similar texts in the training data, texts not to be true, or biased texts. Use of the model is at your sole risk. Colorful Scoop makes no warranty or guarantee of any outputs from the model. Colorful Scoop is not liable for any trouble, loss, or damage arising from the model output.
78
+
79
+ **Author:** Colorful Scoop
added_tokens.json ADDED
@@ -0,0 +1 @@
 
1
+ {"[PAD]": 32000}
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "release/bert-base-ja",
3
+ "architectures": [
4
+ "BertForPreTraining"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 2,
8
+ "cls_token_id": 2,
9
+ "eos_token_id": 3,
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "layer_norm_eps": 1e-12,
17
+ "mask_token_id": 4,
18
+ "max_position_embeddings": 512,
19
+ "model_type": "bert",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 0,
23
+ "position_embedding_type": "absolute",
24
+ "sep_token_id": 3,
25
+ "tokenizer_class": "DebertaV2Tokenizer",
26
+ "transformers_version": "4.8.2",
27
+ "type_vocab_size": 2,
28
+ "unk_token_id": 1,
29
+ "use_cache": true,
30
+ "vocab_size": 32000
31
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7853719a98bc0e610bacf68bc4880226ed9f4212934ecc02fbafd50bb8201c27
3
+ size 445058268
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]"}
spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6467857b4b0c77ded9bac7ad2fb5c16eb64e17e417ce46624dacac2bbb404fc
3
+ size 802713
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcf53193021064196db5b46bc478b2eeace103aee0b68e75213efc1abbcc141b
3
+ size 545150280
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": false, "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]", "split_by_punct": false, "sp_model_kwargs": {}, "tokenizer_class": "DebertaV2Tokenizer"}