conan1024hao commited on
Commit
e3ec7a1
1 Parent(s): 40948ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -34,6 +34,8 @@ You can fine-tune this model on downstream tasks.
34
 
35
  The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
36
 
 
 
37
  ## Vocabulary
38
 
39
  The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
 
34
 
35
  The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
36
 
37
+ `BertJapaneseTokenizer` now supports automatic `JumanppTokenizer` and `SentencepieceTokenizer`. You can use [this model](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp) without any data preprocessing.
38
+
39
  ## Vocabulary
40
 
41
  The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).