tealgreen0503
commited on
Commit
·
dbaae2b
1
Parent(s):
ac868f8
update: README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ metrics:
|
|
14 |
- accuracy
|
15 |
mask_token: "[MASK]"
|
16 |
widget:
|
17 |
-
- text: "
|
18 |
---
|
19 |
|
20 |
# Model Card for Japanese DeBERTa V2 base
|
@@ -29,10 +29,10 @@ You can use this model for masked language modeling as follows:
|
|
29 |
|
30 |
```python
|
31 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
32 |
-
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-base-japanese')
|
33 |
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-base-japanese')
|
34 |
|
35 |
-
sentence = '
|
36 |
encoding = tokenizer(sentence, return_tensors='pt')
|
37 |
...
|
38 |
```
|
@@ -41,7 +41,9 @@ You can also fine-tune this model on downstream tasks.
|
|
41 |
|
42 |
## Tokenization
|
43 |
|
44 |
-
The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece)
|
|
|
|
|
45 |
|
46 |
## Training data
|
47 |
|
|
|
14 |
- accuracy
|
15 |
mask_token: "[MASK]"
|
16 |
widget:
|
17 |
+
- text: "京都大学で自然言語処理を[MASK]する。"
|
18 |
---
|
19 |
|
20 |
# Model Card for Japanese DeBERTa V2 base
|
|
|
29 |
|
30 |
```python
|
31 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
32 |
+
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-base-japanese', trust_remote_code=True)
|
33 |
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-base-japanese')
|
34 |
|
35 |
+
sentence = '京都大学で自然言語処理を[MASK]する。'
|
36 |
encoding = tokenizer(sentence, return_tensors='pt')
|
37 |
...
|
38 |
```
|
|
|
41 |
|
42 |
## Tokenization
|
43 |
|
44 |
+
~~The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).~~
|
45 |
+
|
46 |
+
UPDATE: The input text is internally segmented by [Juman++](https://github.com/ku-nlp/jumanpp) within `DebertaV2JumanppTokenizer(Fast)`, so there's no need to segment it in advance. To use `DebertaV2JumanppTokenizer(Fast)`, you need to install [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) and [rhoknp](https://github.com/ku-nlp/rhoknp).
|
47 |
|
48 |
## Training data
|
49 |
|