Update README.md
Browse files
README.md
CHANGED
@@ -63,7 +63,7 @@ def preprocess(text):
|
|
63 |
|
64 |
from transformers import BertTokenizer, RobertaModel
|
65 |
tokenizer = BertTokenizer.from_pretrained('ku-accms/roberta-base-japanese-ssuw')
|
66 |
-
model =
|
67 |
text = "京都大学で自然言語処理を専攻する。"
|
68 |
encoded_input = tokenizer(preprocess(text), return_tensors='pt')
|
69 |
output = model(**encoded_input)
|
@@ -73,7 +73,7 @@ output = model(**encoded_input)
|
|
73 |
We used a Japanese Wikipedia dump (as of 20230101, 3.3GB) and a Japanese portion of CC100 (70GB).
|
74 |
|
75 |
## Training procedure
|
76 |
-
We first segmented the texts into words by KyTea and then tokenized the words into subwords using WordPiece with a vocabulary size of 32,000. We pre-trained the
|
77 |
|
78 |
The following hyperparameters were used for the pre-training.
|
79 |
|
|
|
63 |
|
64 |
from transformers import BertTokenizer, RobertaModel
|
65 |
tokenizer = BertTokenizer.from_pretrained('ku-accms/roberta-base-japanese-ssuw')
|
66 |
+
model = RobertaModel.from_pretrained("ku-accms/roberta-base-japanese-ssuw")
|
67 |
text = "京都大学で自然言語処理を専攻する。"
|
68 |
encoded_input = tokenizer(preprocess(text), return_tensors='pt')
|
69 |
output = model(**encoded_input)
|
|
|
73 |
We used a Japanese Wikipedia dump (as of 20230101, 3.3GB) and a Japanese portion of CC100 (70GB).
|
74 |
|
75 |
## Training procedure
|
76 |
+
We first segmented the texts into words by KyTea and then tokenized the words into subwords using WordPiece with a vocabulary size of 32,000. We pre-trained the RoBERTa model using [transformers](https://github.com/huggingface/transformers) library. The training took about 7 days using 4 NVIDIA A100-SXM4-80GB GPUs.
|
77 |
|
78 |
The following hyperparameters were used for the pre-training.
|
79 |
|