retarfi commited on
Commit
5f961ce
1 Parent(s): df5e5f2
Files changed (1) hide show
  1. README.md +89 -1
README.md CHANGED
@@ -1,3 +1,91 @@
1
  ---
2
- license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ja
3
+ license: cc-by-sa-4.0
4
+ library_name: transformers
5
+ datasets:
6
+ - cc100
7
+ - mc4
8
+ - oscar
9
+ - wikipedia
10
+ - izumi-lab/cc100-ja
11
+ - izumi-lab/mc4-ja-filter-ja-normal
12
+ - izumi-lab/oscar2301-ja-filter-ja-normal
13
+ - izumi-lab/wikipedia-ja-20230720
14
+ - izumi-lab/wikinews-ja-20230728
15
+
16
+ widget:
17
+ - text: 東京大学で[MASK]の研究をしています。
18
+
19
  ---
20
+
21
+ # DeBERTa V2 small Japanese
22
+
23
+ This is a [DeBERTaV2](https://github.com/microsoft/DeBERTa) model pretrained on Japanese texts.
24
+ The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/releases/tag/v2.2.1).
25
+
26
+
27
+ ## How to use
28
+
29
+ You can use this model for masked language modeling as follows:
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
33
+ tokenizer = AutoTokenizer.from_pretrained("izumi-lab/deberta-v2-base-japanese")
34
+ model = AutoModelForMaskedLM.from_pretrained("izumi-lab/deberta-v2-base-japanese")
35
+ ...
36
+ ```
37
+
38
+
39
+ ## Tokenization
40
+
41
+ The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using [sentencepiece](https://github.com/google/sentencepiece).
42
+
43
+
44
+ ## Training Data
45
+
46
+ We used the following corpora for pre-training:
47
+
48
+ - [Japanese portion of CC-100](https://huggingface.co/datasets/izumi-lab/cc100-ja)
49
+ - [Japanese portion of mC4](https://huggingface.co/datasets/izumi-lab/mc4-ja-filter-ja-normal)
50
+ - [Japanese portion of OSCAR2301](izumi-lab/oscar2301-ja-filter-ja-normal)
51
+ - [Japanese Wikipedia as of July 20, 2023](https://huggingface.co/datasets/izumi-lab/wikipedia-ja-20230720)
52
+ - [Japanese Wikinews as of July 28, 2023](https://huggingface.co/datasets/izumi-lab/wikinews-ja-20230728)
53
+
54
+
55
+ ## Training Parameters
56
+
57
+ - learning_rate: 6e-4
58
+ - total_train_batch_size: 2,016
59
+ - max_seq_length: 128
60
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
61
+ - lr_scheduler_type: linear schedule with warmup
62
+ - training_steps: 1,000,000
63
+ - warmup_steps: 100,000
64
+ - precision: BF16
65
+
66
+
67
+ ## Fine-tuning on General NLU tasks
68
+
69
+ We evaluate our model with the average of five seeds.
70
+
71
+
72
+ | Model | JSTS | JNLI | JCommonsenseQA |
73
+ |---------------------------------------------------------------------------|------------------|-----------|----------------|
74
+ | | Pearson/Spearman | acc | acc |
75
+ | **DeBERTaV2 small** | **0.890/0.846** | **0.880** | **0.737** |
76
+ | [UTokyo BERT small](https://huggingface.co/izumi-lab/bert-small-japanese) | 0.889/0.841 | 0.841 | 0.715 |
77
+
78
+
79
+ ## Citation
80
+
81
+ TBA
82
+
83
+
84
+ ## Licenses
85
+
86
+ The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
87
+
88
+
89
+ ## Acknowledgments
90
+
91
+ This work was supported in part by JSPS KAKENHI Grant Number JP21K12010, and the JST-Mirai Program Grant Number JPMJMI20B1, Japan.