ku-nlp
/

roberta-large-japanese-char-wwm

Inference Endpoints

Model card Files Files and versions Community

Nobuhiro Ueda commited on Sep 18, 2022

Commit

6935ff9

•

1 Parent(s): 579dc96

add README.md

Files changed (1) hide show

README.md +60 -0

README.md ADDED Viewed

	@@ -0,0 +1,60 @@

+---
+language: ja
+license: cc-by-sa-4.0
+datasets:
+- wikipedia
+- cc100
+mask_token: "[MASK]"
+widget:
+- text: "京都大学で自然言語処理を [MASK] する。"
+---
+# ku-nlp/roberta-large-japanese-char-wwm
+## Model description
+This is a Japanese RoBERTa large model pre-trained on Japanese Wikipedia and the Japanese portion of CC-100.
+This model is trained with character-level tokenization and whole word masking.
+## How to use
+You can use this model for masked language modeling as follows:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("ku-nlp/roberta-large-japanese-char-wwm")
+model = AutoModelForMaskedLM.from_pretrained("ku-nlp/roberta-large-japanese-char-wwm")
+sentence = '京都大学で自然言語処理を [MASK] する。'
+encoding = tokenizer(sentence, return_tensors='pt')
+...
+```
+You can fine-tune this model on downstream tasks.
+## Tokenization
+There is no need to tokenize texts in advance, and you can give raw texts to the tokenizer.
+The texts are tokenized into character-level tokens by [sentencepiece](https://github.com/google/sentencepiece).
+## Vocabulary
+The vocabulary consists of 18,377 tokens including all characters that appear in the training corpus.
+## Training procedure
+This model was trained on Japanese Wikipedia (as of 20220220) and the Japanese portion of CC-100. It took a month using 8-16 NVIDIA A100 GPUs.
+The following hyperparameters were used during pre-training:
+- learning_rate: 5e-5
+- per_device_train_batch_size: 38
+- distributed_type: multi-GPU
+- num_devices: 16
+- gradient_accumulation_steps: 8
+- total_train_batch_size: 4864
+- max_seq_length: 512
+- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
+- lr_scheduler_type: linear schedule with warmup
+- training_steps: 500000
+- warmup_steps: 10000