Fill-Mask
Transformers
PyTorch
Japanese
deberta-v2
Inference Endpoints
retarfi commited on
Commit
a93c61f
1 Parent(s): 7fa9d3c
README.md CHANGED
@@ -1,3 +1,100 @@
1
  ---
2
- license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ja
3
+ license: cc-by-sa-4.0
4
+ library_name: transformers
5
+ datasets:
6
+ - cc100
7
+ - mc4
8
+ - oscar
9
+ - wikipedia
10
+ - izumi-lab/cc100-ja
11
+ - izumi-lab/mc4-ja-filter-ja-normal
12
+ - izumi-lab/oscar2301-ja-filter-ja-normal
13
+ - izumi-lab/wikipedia-ja-20230720
14
+ - izumi-lab/wikinews-ja-20230728
15
+
16
+ widget:
17
+ - text: 東京大学で[MASK]の研究をしています。
18
+
19
  ---
20
+
21
+ # DeBERTa V2 base Japanese
22
+
23
+ This is a [DeBERTaV2](https://github.com/microsoft/DeBERTa) model pretrained on Japanese texts.
24
+ The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/releases/tag/v2.2.1).
25
+
26
+
27
+ ## How to use
28
+
29
+ You can use this model for masked language modeling as follows:
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
33
+ tokenizer = AutoTokenizer.from_pretrained("izumi-lab/deberta-v2-base-japanese")
34
+ model = AutoModelForMaskedLM.from_pretrained("izumi-lab/deberta-v2-base-japanese")
35
+ ...
36
+ ```
37
+
38
+
39
+ ## Tokenization
40
+
41
+ The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using [sentencepiece](https://github.com/google/sentencepiece).
42
+
43
+
44
+ ## Training Data
45
+
46
+ We used the following corpora for pre-training:
47
+
48
+ - [Japanese portion of CC-100](https://huggingface.co/datasets/izumi-lab/cc100-ja)
49
+ - [Japanese portion of mC4](https://huggingface.co/datasets/izumi-lab/mc4-ja-filter-ja-normal)
50
+ - [Japanese portion of OSCAR2301](izumi-lab/oscar2301-ja-filter-ja-normal)
51
+ - [Japanese Wikipedia as of July 20, 2023](https://huggingface.co/datasets/izumi-lab/wikipedia-ja-20230720)
52
+ - [Japanese Wikinews as of July 28, 2023](https://huggingface.co/datasets/izumi-lab/wikinews-ja-20230728)
53
+
54
+
55
+ We pretrained with the corpora mentioned above for 900k steps, and additionally pretrained with the following financial corpora for 100k steps:
56
+ - Summaries of financial results from October 9, 2012, to December 31, 2022
57
+ - Securities reports from February 8, 2018, to December 31, 2022
58
+ - News articles
59
+
60
+
61
+ ## Training Parameters
62
+
63
+ learning_rate in parentheses indicate the learning rate for additional pre-training with the financial corpus.
64
+ - learning_rate: 2.4e-4 (6e-5)
65
+ - total_train_batch_size: 2,016
66
+ - max_seq_length: 512
67
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
68
+ - lr_scheduler_type: linear schedule with warmup
69
+ - training_steps: 1,000,000
70
+ - warmup_steps: 100,000
71
+ - precision: FP16
72
+
73
+
74
+ ## Fine-tuning on General NLU tasks
75
+
76
+ We evaluate our model with the average of five seeds.
77
+ Other models are from [JGLUE repository](https://github.com/yahoojapan/JGLUE)
78
+
79
+
80
+ | Model | JSTS | JNLI | JCommonsenseQA |
81
+ |-------------------------------|------------------|-----------|----------------|
82
+ | | Pearson/Spearman | acc | acc |
83
+ | **DeBERTaV2 base** | **0.890/0.846** | **0.xxx** | **0.859** |
84
+ | Waseda RoBERTa base | 0.913/0.873 | 0.895 | 0.840 |
85
+ | Tohoku BERT base | 0.909/0.868 | 0.899 | 0.808 |
86
+
87
+
88
+ ## Citation
89
+
90
+ TBA
91
+
92
+
93
+ ## Licenses
94
+
95
+ The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
96
+
97
+
98
+ ## Acknowledgments
99
+
100
+ This work was supported in part by JSPS KAKENHI Grant Number JP21K12010, and the JST-Mirai Program Grant Number JPMJMI20B1, Japan.
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DebertaV2ForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.1,
8
+ "hidden_size": 768,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 3072,
11
+ "layer_norm_eps": 1e-07,
12
+ "max_position_embeddings": 512,
13
+ "max_relative_positions": -1,
14
+ "model_type": "deberta-v2",
15
+ "norm_rel_ebd": "layer_norm",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 1,
19
+ "pooler_dropout": 0,
20
+ "pooler_hidden_act": "gelu",
21
+ "pooler_hidden_size": 768,
22
+ "pos_att_type": "p2c|c2p",
23
+ "position_biased_input": false,
24
+ "position_buckets": 256,
25
+ "relative_attention": true,
26
+ "share_att_key": true,
27
+ "torch_dtype": "float16",
28
+ "transformers_version": "4.31.0",
29
+ "type_vocab_size": 0,
30
+ "vocab_size": 32000
31
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:596337893657556b383e9813945f6cb3a990f5e0e90530d64b1d49941cd2ca37
3
+ size 542676485
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "[SEP]",
5
+ "mask_token": "[MASK]",
6
+ "pad_token": "[PAD]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": "[UNK]"
9
+ }
spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8e9cbe24bc1bb25ef87a4371c222666539011d1a749cd4858a88a64771acc1a
3
+ size 804800
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "[CLS]",
5
+ "do_lower_case": false,
6
+ "eos_token": "[SEP]",
7
+ "mask_token": "[MASK]",
8
+ "model_max_length": 1000000000000000019884624838656,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "sp_model_kwargs": {},
12
+ "split_by_punct": false,
13
+ "tokenizer_class": "DebertaV2Tokenizer",
14
+ "unk_token": "[UNK]"
15
+ }