tealgreen0503 commited on
Commit
dbaae2b
·
1 Parent(s): ac868f8

update: README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -14,7 +14,7 @@ metrics:
14
  - accuracy
15
  mask_token: "[MASK]"
16
  widget:
17
- - text: "京都 大学 で 自然 言語 処理 を [MASK] する 。"
18
  ---
19
 
20
  # Model Card for Japanese DeBERTa V2 base
@@ -29,10 +29,10 @@ You can use this model for masked language modeling as follows:
29
 
30
  ```python
31
  from transformers import AutoTokenizer, AutoModelForMaskedLM
32
- tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-base-japanese')
33
  model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-base-japanese')
34
 
35
- sentence = '京都 大学 で 自然 言語 処理 を [MASK] する 。' # input should be segmented into words by Juman++ in advance
36
  encoding = tokenizer(sentence, return_tensors='pt')
37
  ...
38
  ```
@@ -41,7 +41,9 @@ You can also fine-tune this model on downstream tasks.
41
 
42
  ## Tokenization
43
 
44
- The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
 
 
45
 
46
  ## Training data
47
 
 
14
  - accuracy
15
  mask_token: "[MASK]"
16
  widget:
17
+ - text: "京都大学で自然言語処理を[MASK]する。"
18
  ---
19
 
20
  # Model Card for Japanese DeBERTa V2 base
 
29
 
30
  ```python
31
  from transformers import AutoTokenizer, AutoModelForMaskedLM
32
+ tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-base-japanese', trust_remote_code=True)
33
  model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-base-japanese')
34
 
35
+ sentence = '京都大学で自然言語処理を[MASK]する。'
36
  encoding = tokenizer(sentence, return_tensors='pt')
37
  ...
38
  ```
 
41
 
42
  ## Tokenization
43
 
44
+ ~~The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).~~
45
+
46
+ UPDATE: The input text is internally segmented by [Juman++](https://github.com/ku-nlp/jumanpp) within `DebertaV2JumanppTokenizer(Fast)`, so there's no need to segment it in advance. To use `DebertaV2JumanppTokenizer(Fast)`, you need to install [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) and [rhoknp](https://github.com/ku-nlp/rhoknp).
47
 
48
  ## Training data
49