Matttttttt commited on
Commit
963405b
1 Parent(s): 9644c4c

update README, fixed a minor bug in tokenizer

Browse files
README.md CHANGED
@@ -7,20 +7,20 @@ datasets:
7
  - wikipedia
8
  ---
9
 
10
- # Model Card for Japanese BART V2 large
11
 
12
  ## Model description
13
 
14
- This is a Japanese BART V2 large model pre-trained on Japanese Wikipedia.
15
 
16
  ## How to use
17
 
18
  You can use this model as follows:
19
 
20
  ```python
21
- from transformers import XLMRobertaTokenizer, MBartForConditionalGeneration
22
- tokenizer = XLMRobertaTokenizer.from_pretrained('ku-nlp/bart-v2-large-japanese')
23
- model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-v2-large-japanese')
24
  sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。' # input should be segmented into words by Juman++ in advance
25
  encoding = tokenizer(sentence, return_tensors='pt')
26
  ...
 
7
  - wikipedia
8
  ---
9
 
10
+ # Model Card for Japanese BART large
11
 
12
  ## Model description
13
 
14
+ This is a Japanese BART large model pre-trained on Japanese Wikipedia.
15
 
16
  ## How to use
17
 
18
  You can use this model as follows:
19
 
20
  ```python
21
+ from transformers import AutoTokenizer, MBartForConditionalGeneration
22
+ tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-large-japanese')
23
+ model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-large-japanese')
24
  sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。' # input should be segmented into words by Juman++ in advance
25
  encoding = tokenizer(sentence, return_tensors='pt')
26
  ...
sentencepiece.bpe.model CHANGED
Binary files a/sentencepiece.bpe.model and b/sentencepiece.bpe.model differ
 
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,5 +1,6 @@
1
  {
2
  "bos_token": "<s>",
 
3
  "cls_token": "<s>",
4
  "eos_token": "</s>",
5
  "mask_token": {
@@ -10,6 +11,7 @@
10
  "rstrip": false,
11
  "single_word": false
12
  },
 
13
  "pad_token": "<pad>",
14
  "sep_token": "</s>",
15
  "sp_model_kwargs": {},
 
1
  {
2
  "bos_token": "<s>",
3
+ "clean_up_tokenization_spaces": true,
4
  "cls_token": "<s>",
5
  "eos_token": "</s>",
6
  "mask_token": {
 
11
  "rstrip": false,
12
  "single_word": false
13
  },
14
+ "model_max_length": 1000000000000000019884624838656,
15
  "pad_token": "<pad>",
16
  "sep_token": "</s>",
17
  "sp_model_kwargs": {},