# cl-tohoku /bert-base-japanese-v2

 1 --- 2 language: ja 3 license: cc-by-sa-4.0 4 datasets: 5 - wikipedia 6 widget: 7 - text: 東北大学で[MASK]の研究をしています。 8 --- 9 10 # BERT base Japanese (unidic-lite with whole word masking, jawiki-20200831) 11 12 This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language. 13 14 This version of the model processes input texts with word-level tokenization based on the Unidic 2.1.2 dictionary (available in [unidic-lite](https://pypi.org/project/unidic-lite/) package), followed by the WordPiece subword tokenization. 15 Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. 16 17 The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v2.0). 18 19 ## Model architecture 20 21 The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. 22 23 ## Training Data 24 25 The models are trained on the Japanese version of Wikipedia. 26 The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. 27 28 The generated corpus files are 4.0GB in total, containing approximately 30M sentences. 29 We used the [MeCab](https://taku910.github.io/mecab/) morphological parser with [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd) dictionary to split texts into sentences. 30 31 ## Tokenization 32 33 The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. 34 The vocabulary size is 32768. 35 36 We used [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://github.com/polm/unidic-lite) packages for the tokenization. 37 38 ## Training 39 40 The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. 41 For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. 42 43 For training of each model, we used a v3-8 instance of Cloud TPUs provided by [TensorFlow Research Cloud program](https://www.tensorflow.org/tfrc/). 44 The training took about 5 days to finish. 45 46 ## Licenses 47 48 The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/). 49 50 ## Acknowledgments 51 52 This model is trained with Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program. 53