# BERT base Japanese (character tokenization)

This is a BERT model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.

The codes for the pretraining are available at cl-tohoku/bert-japanese.

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019. To generate the training corpus, WikiExtractor is used to extract plain texts from a dump file of Wikipedia articles. The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into characters. The vocabulary size is 4000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

## Acknowledgments

For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

Mask token: [MASK]