This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of CC-100.
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese") model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese") sentence = '早稲田 大学 で 自然 言語 処理 を [MASK] する 。' # input should be segmented into words by Juman++ in advance encoding = tokenizer(sentence, return_tensors='pt') ...
You can fine-tune this model on downstream tasks.
BertJapaneseTokenizer now supports automatic
SentencepieceTokenizer. You can use this model without any data preprocessing.
This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC-100. It took a week using eight NVIDIA A100 GPUs.
The following hyperparameters were used during pretraining:
- learning_rate: 1e-4
- per_device_train_batch_size: 256
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 4096
- max_seq_length: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- training_steps: 700000
- warmup_steps: 10000
- mixed_precision_training: Native AMP
See the Baseline Scores of JGLUE.
- Downloads last month