This is a Japanese RoBERTa large model pretrained on Japanese Wikipedia and the Japanese portion of CC-100 with the maximum sequence length of 512.
How to use
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp") model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp") sentence = '早稲田大学で自然言語処理を[MASK]する。' encoding = tokenizer(sentence, return_tensors='pt') ...
You can fine-tune this model on downstream tasks.
BertJapaneseTokenizer now supports automatic tokenization for Juman++. However, if your dataset is large, you may take a long time since
BertJapaneseTokenizer still does not supoort fast tokenization. You can still do the Juman++ tokenization by your self and use the old model nlp-waseda/roberta-large-japanese-seq512.
Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.
The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.
This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC-100 from the checkpoint of nlp-waseda/roberta-large-japanese. It took a week using eight NVIDIA A100 GPUs.
The following hyperparameters were used during pretraining:
- learning_rate: 6e-5
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 4120 (max_seq_length=128), 4032 (max_seq_length=512)
- max_seq_length: 512
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-6
- lr_scheduler_type: linear
- training_steps: 670000 (max_seq_length=128) + 70000 (max_seq_length=512)
- warmup_steps: 10000
- mixed_precision_training: Native AMP
- Downloads last month