nobu-g commited on
Commit
c250815
1 Parent(s): a01762f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -16
README.md CHANGED
@@ -14,14 +14,15 @@ metrics:
14
  - accuracy
15
  mask_token: "[MASK]"
16
  widget:
17
- - text: "京都 大学 で 自然 言語 処理 を [MASK] する 。"
18
  ---
19
 
20
  # Model Card for Japanese DeBERTa V2 large
21
 
22
  ## Model description
23
 
24
- This is a Japanese DeBERTa V2 large model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the Japanese portion of OSCAR.
 
25
 
26
  ## How to use
27
 
@@ -29,6 +30,7 @@ You can use this model for masked language modeling as follows:
29
 
30
  ```python
31
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
32
  tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-large-japanese')
33
  model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-large-japanese')
34
 
@@ -41,7 +43,9 @@ You can also fine-tune this model on downstream tasks.
41
 
42
  ## Tokenization
43
 
44
- The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
 
 
45
 
46
  ## Training data
47
 
@@ -52,14 +56,17 @@ We used the following corpora for pre-training:
52
  - Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)
53
 
54
  Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR.
55
- Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of CC-100 and OSCAR. As a result, the total size of the training data is 171GB.
 
56
 
57
  ## Training procedure
58
 
59
  We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
60
- Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
 
61
 
62
- We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese DeBERTa model using [transformers](https://github.com/huggingface/transformers) library.
 
63
  The training took 36 days using 8 NVIDIA A100-SXM4-40GB GPUs.
64
 
65
  The following hyperparameters were used during pre-training:
@@ -82,18 +89,23 @@ The evaluation set consists of 5,000 randomly sampled documents from each of the
82
  ## Fine-tuning on NLU tasks
83
 
84
  We fine-tuned the following models and evaluated them on the dev set of JGLUE.
85
- We tuned learning rate and training epochs for each model and task following [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
 
 
 
 
 
 
 
 
 
 
86
 
87
- | Model | MARC-ja/acc | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
88
- |-------------------------------|-------------|---------------|----------|-----------|-----------|------------|
89
- | Waseda RoBERTa base | 0.965 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
90
- | Waseda RoBERTa large (seq512) | 0.969 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
91
- | LUKE Japanese base* | 0.965 | 0.877 | 0.912 | - | - | 0.842 |
92
- | LUKE Japanese large* | 0.965 | 0.902 | 0.927 | - | - | 0.893 |
93
- | DeBERTaV2 base | 0.970 | 0.886 | 0.922 | 0.899 | 0.951 | 0.873 |
94
- | DeBERTaV2 large | 0.968 | 0.892 | 0.924 | 0.912 | 0.959 | 0.890 |
95
 
96
  ## Acknowledgments
97
 
98
- This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of Large-Scale Japanese Language Models".
 
 
99
  For training models, we used the mdx: a platform for the data-driven future.
 
14
  - accuracy
15
  mask_token: "[MASK]"
16
  widget:
17
+ - text: "京都 大学 で 自然 言語 処理 を [MASK] する 。"
18
  ---
19
 
20
  # Model Card for Japanese DeBERTa V2 large
21
 
22
  ## Model description
23
 
24
+ This is a Japanese DeBERTa V2 large model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the
25
+ Japanese portion of OSCAR.
26
 
27
  ## How to use
28
 
 
30
 
31
  ```python
32
  from transformers import AutoTokenizer, AutoModelForMaskedLM
33
+
34
  tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-large-japanese')
35
  model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-large-japanese')
36
 
 
43
 
44
  ## Tokenization
45
 
46
+ The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in
47
+ advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each
48
+ word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
49
 
50
  ## Training data
51
 
 
56
  - Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)
57
 
58
  Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR.
59
+ Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of
60
+ CC-100 and OSCAR. As a result, the total size of the training data is 171GB.
61
 
62
  ## Training procedure
63
 
64
  We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
65
+ Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC))
66
+ and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
67
 
68
+ We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese DeBERTa model
69
+ using [transformers](https://github.com/huggingface/transformers) library.
70
  The training took 36 days using 8 NVIDIA A100-SXM4-40GB GPUs.
71
 
72
  The following hyperparameters were used during pre-training:
 
89
  ## Fine-tuning on NLU tasks
90
 
91
  We fine-tuned the following models and evaluated them on the dev set of JGLUE.
92
+ We tuned learning rate and training epochs for each model and task
93
+ following [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
94
+
95
+ | Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
96
+ |-------------------------------|-------------|--------------|---------------|----------|-----------|-----------|------------|
97
+ | Waseda RoBERTa base | 0.965 | 0.913 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
98
+ | Waseda RoBERTa large (seq512) | 0.969 | 0.925 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
99
+ | LUKE Japanese base* | 0.965 | 0.916 | 0.877 | 0.912 | - | - | 0.842 |
100
+ | LUKE Japanese large* | 0.965 | 0.932 | 0.902 | 0.927 | - | - | 0.893 |
101
+ | DeBERTaV2 base | 0.970 | 0.922 | 0.886 | 0.922 | 0.899 | 0.951 | 0.873 |
102
+ | DeBERTaV2 large | 0.968 | 0.925 | 0.892 | 0.924 | 0.912 | 0.959 | 0.890 |
103
 
104
+ *The scores of LUKE are from [the official repository](https://github.com/studio-ousia/luke).
 
 
 
 
 
 
 
105
 
106
  ## Acknowledgments
107
 
108
+ This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (
109
+ JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of
110
+ Large-Scale Japanese Language Models".
111
  For training models, we used the mdx: a platform for the data-driven future.