language: ja
license: cc-by-sa-4.0
datasets:
- wikipedia
- cc100
mask_token: '[MASK]'
widget:
- text: 京都大学で自然言語処理を[MASK]する。
ku-nlp/roberta-base-japanese-char-wwm
Model description
This is a Japanese RoBERTa base model pre-trained on Japanese Wikipedia and the Japanese portion of CC-100. This model is trained with character-level tokenization and whole word masking.
How to use
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/roberta-base-japanese-char-wwm')
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/roberta-base-japanese-char-wwm')
sentence = '京都大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...
You can fine-tune this model on downstream tasks.
Tokenization
There is no need to tokenize texts in advance, and you can give raw texts to the tokenizer. The texts are tokenized into character-level tokens by sentencepiece.
Vocabulary
The vocabulary consists of 18,377 tokens including all characters that appear in the training corpus.
Training procedure
This model was trained on Japanese Wikipedia (as of 20220220) and the Japanese portion of CC-100. It took two weeks using 8 NVIDIA A100 GPUs.
The following hyperparameters were used during pre-training:
- learning_rate: 1e-4
- per_device_train_batch_size: 62
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 8
- total_train_batch_size: 3968
- max_seq_length: 512
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear schedule with warmup
- training_steps: 330000
- warmup_steps: 10000
Acknowledgments
This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of Large-Scale Japanese Language Models". For training models, we used the mdx: a platform for the data-driven future.