metadata
language: ja
tags:
- exbert
license: cc-by-sa-4.0
datasets:
- wikipedia
- cc100
mask_token: '[MASK]'
widget:
- text: 早稲田 大学 で 自然 言語 処理 を [MASK] する 。
nlp-waseda/roberta-base-japanese
Model description
This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of CC-100.
How to use
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese")
sentence = '早稲田 大学 で 自然 言語 処理 を [MASK] する 。' # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...
You can use this model for fine-tuning on downstream tasks.
Tokenization
The input text should be segmented into words by Juman++ in advance. Each word is tokenized into subwords by sentencepiece.
Vocabulary
The vocabulary consists of 32000 subwords induced by the unigram language model of sentencepiece.
Training procedure
This model was trained on Japanese Wikipedia and the Japanese portion of CC-100. It took a week using eight NVIDIA A100 GPUs.
The following hyperparameters were used during pretraining:
- learning_rate: 1e-4
- per_device_train_batch_size: 256
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 4096
- max_seq_length: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- training_steps: 700000
- mixed_precision_training: Native AMP
Performance on JGLUE
coming soon