update README, fixed a minor bug in tokenizer

Files changed (4) hide show

README.md CHANGED Viewed

@@ -7,20 +7,20 @@ datasets:
 - wikipedia
 ---
-# Model Card for Japanese BART V2 large
 ## Model description
-This is a Japanese BART V2 large model pre-trained on Japanese Wikipedia.
 ## How to use
 You can use this model as follows:
 ```python
-from transformers import XLMRobertaTokenizer, MBartForConditionalGeneration
-tokenizer = XLMRobertaTokenizer.from_pretrained('ku-nlp/bart-v2-large-japanese')
-model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-v2-large-japanese')
 sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
 encoding = tokenizer(sentence, return_tensors='pt')
 ...

 - wikipedia
 ---
+# Model Card for Japanese BART large
 ## Model description
+This is a Japanese BART large model pre-trained on Japanese Wikipedia.
 ## How to use
 You can use this model as follows:
 ```python
+from transformers import AutoTokenizer, MBartForConditionalGeneration
+tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-large-japanese')
+model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-large-japanese')
 sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
 encoding = tokenizer(sentence, return_tensors='pt')
 ...

sentencepiece.bpe.model CHANGED Viewed

Binary files a/sentencepiece.bpe.model and b/sentencepiece.bpe.model differ

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -1,5 +1,6 @@
 {
   "bos_token": "<s>",
   "cls_token": "<s>",
   "eos_token": "</s>",
   "mask_token": {
@@ -10,6 +11,7 @@
     "rstrip": false,
     "single_word": false
   },
   "pad_token": "<pad>",
   "sep_token": "</s>",
   "sp_model_kwargs": {},

 {
   "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
   "cls_token": "<s>",
   "eos_token": "</s>",
   "mask_token": {
     "rstrip": false,
     "single_word": false
   },
+  "model_max_length": 1000000000000000019884624838656,
   "pad_token": "<pad>",
   "sep_token": "</s>",
   "sp_model_kwargs": {},