clee84 commited on
Commit
42d1e45
โ€ข
1 Parent(s): a2275d3

[init] Add model files

Browse files
Files changed (6) hide show
  1. README.md +117 -0
  2. config.json +22 -0
  3. pytorch_model.bin +3 -0
  4. special_tokens_map.json +1 -0
  5. spm.model +3 -0
  6. tokenizer_config.json +1 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - en
5
+ - ko
6
+ - ar
7
+ - bg
8
+ - de
9
+ - el
10
+ - es
11
+ - fr
12
+ - hi
13
+ - ru
14
+ - sw
15
+ - th
16
+ - tr
17
+ - ur
18
+ - vi
19
+ - zh
20
+ tags:
21
+ - deberta
22
+ - deberta-v3
23
+ - mdeberta
24
+ - korean
25
+ - pretraining
26
+ license: mit
27
+ ---
28
+
29
+ # mDeBERTa-v3-base-kor-further
30
+
31
+ > ๐Ÿ’ก ์•„๋ž˜ ํ”„๋กœ์ ํŠธ๋Š”ย KPMG Lighthouse Korea์—์„œ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
32
+ > KPMG Lighthouse Korea์—์„œ๋Š”, Financial area์˜ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Edge Technology์˜ NLP/Vision AI๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
33
+ > https://kpmgkr.notion.site/
34
+
35
+ ## What is DeBERTa?
36
+ - [DeBERTa](https://arxiv.org/abs/2006.03654)๋Š” `Disentangled Attention` + `Enhanced Mask Decoder` ๋ฅผ ์ ์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ positional information์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด์™€ ๊ฐ™์€ ์•„์ด๋””์–ด๋ฅผ ํ†ตํ•ด, ๊ธฐ์กด์˜ BERT, RoBERTa์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ absolute position embedding๊ณผ๋Š” ๋‹ฌ๋ฆฌ DeBERTa๋Š” ๋‹จ์–ด์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, BERT, RoBERTA ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋” ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
37
+ - [DeBERTa-v3](https://arxiv.org/abs/2111.09543)์—์„œ๋Š”, ์ด์ „ ๋ฒ„์ „์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ MLM (Masked Language Model) ์„ RTD (Replaced Token Detection) Task ๋กœ ๋Œ€์ฒดํ•œ ELECTRA ์Šคํƒ€์ผ์˜ ์‚ฌ์ „ํ•™์Šต ๋ฐฉ๋ฒ•๊ณผ, Gradient-Disentangled Embedding Sharing ์„ ์ ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต์˜ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
38
+ - DeBERTa์˜ ์•„ํ‚คํ…์ฒ˜๋กœ ํ’๋ถ€ํ•œ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ, `mDeBERTa-v3-base-kor-further` ๋Š” microsoft ๊ฐ€ ๋ฐœํ‘œํ•œ `mDeBERTa-v3-base` ๋ฅผ ์•ฝ 40GB์˜ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ **์ถ”๊ฐ€์ ์ธ ์‚ฌ์ „ํ•™์Šต**์„ ์ง„ํ–‰ํ•œ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
39
+
40
+ ## How to Use
41
+ - Requirements
42
+ ```
43
+ pip install transformers
44
+ pip install sentencepiece
45
+ ```
46
+ - Huggingface Hub
47
+ ```python
48
+ from transformers import AutoModel, AutoTokenizer
49
+
50
+ model = AutoModel.from_pretrained("lighthouse/mdeberta-v3-base-kor-further") # DebertaV2ForModel
51
+ tokenizer = AutoTokenizer.from_pretrained("lighthouse/mdeberta-v3-base-kor-further") # DebertaV2Tokenizer (SentencePiece)
52
+ ```
53
+
54
+ ## Pre-trained Models
55
+ - ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜๋Š” ๊ธฐ์กด microsoft์—์„œ ๋ฐœํ‘œํ•œ `mdeberta-v3-base`์™€ ๋™์ผํ•œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.
56
+
57
+ | | Vocabulary(K) | Backbone Parameters(M) | Hidden Size | Layers | Note |
58
+ | --- | --- | --- | --- | --- | --- |
59
+ | mdeberta-v3-base-kor-further (mdeberta-v3-base์™€ ๋™์ผ) | 250 | 86 | 768 | 12 | 250K new SPM vocab |
60
+
61
+ ## Further Pretraing Details (MLM Task)
62
+ - `mDeBERTa-v3-base-kor-further` ๋Š” `microsoft/mDeBERTa-v3-base` ๋ฅผ ์•ฝ 40GB์˜ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ MLM Task๋ฅผ ์ ์šฉํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ์‚ฌ์ „ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
63
+
64
+ | | Max length | Learning Rate | Batch Size | Train Steps | Warm-up Steps |
65
+ | --- | --- | --- | --- | --- | --- |
66
+ | mdeberta-v3-base-kor-further | 512 | 2e-5 | 8 | 5M | 50k |
67
+
68
+
69
+ ## Datasets
70
+ - ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜(์‹ ๋ฌธ, ๊ตฌ์–ด, ๋ฌธ์–ด), ํ•œ๊ตญ์–ด Wiki, ๊ตญ๋ฏผ์ฒญ์› ๋“ฑ ์•ฝ 40 GB ์˜ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹์ด ์ถ”๊ฐ€์ ์ธ ์‚ฌ์ „ํ•™์Šต์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
71
+ - Train: 10M lines, 5B tokens
72
+ - Valid: 2M lines, 1B tokens
73
+ - cf) ๊ธฐ์กด mDeBERTa-v3์€ XLM-R ๊ณผ ๊ฐ™์ด [cc-100 ๋ฐ์ดํ„ฐ์…‹](https://data.statmt.org/cc-100/)์œผ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ, ๊ทธ ์ค‘ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๋Š” 54GB์ž…๋‹ˆ๋‹ค.
74
+
75
+
76
+ ## Fine-tuning on NLU Tasks - Base Model
77
+ | Model | Size | NSMC(acc) | Naver NER(F1) | PAWS (acc) | KorNLI (acc) | KorSTS (spearman) | Question Pair (acc) | KorQuaD (Dev) (EM/F1) | Korean-Hate-Speech (Dev) (F1) |
78
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
79
+ | XLM-Roberta-Base | 1.03G | 89.03 | 86.65 | 82.80 | 80.23 | 78.45 | 93.80 | 64.70 / 88.94 | 64.06 |
80
+ | mdeberta-base | 534M | 90.01 | 87.43 | 85.55 | 80.41 | **82.65** | 94.06 | 65.48 / 89.74 | 62.91 |
81
+ | mdeberta-base-kor-further (Ours) | 534M | **90.52** | **87.87** | **85.85** | **80.65** | 81.90 | **94.98** | **66.07 / 90.35** | **68.16** |
82
+
83
+
84
+ ## KPMG Lighthouse KR
85
+ https://kpmgkr.notion.site/
86
+
87
+
88
+ ## Citation
89
+ ```
90
+ @misc{he2021debertav3,
91
+ title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing},
92
+ author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
93
+ year={2021},
94
+ eprint={2111.09543},
95
+ archivePrefix={arXiv},
96
+ primaryClass={cs.CL}
97
+ }
98
+ ```
99
+
100
+ ```
101
+ @inproceedings{
102
+ he2021deberta,
103
+ title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
104
+ author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
105
+ booktitle={International Conference on Learning Representations},
106
+ year={2021},
107
+ url={https://openreview.net/forum?id=XPZIaotutsD}
108
+ }
109
+ ```
110
+
111
+ ## Reference
112
+ - [mDeBERTa-v3-base-kor-further](https://github.com/kpmg-kr/mDeBERTa-v3-base-kor-further)
113
+ - [DeBERTa](https://github.com/microsoft/DeBERTa)
114
+ - [Huggingface Transformers](https://github.com/huggingface/transformers)
115
+ - [๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/)
116
+ - [Korpora: Korean Corpora Archives](https://github.com/ko-nlp/Korpora)
117
+ - [sooftware/Korean PLM](https://github.com/sooftware/Korean-PLM)
config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "deberta-v2",
3
+ "attention_probs_dropout_prob": 0.1,
4
+ "hidden_act": "gelu",
5
+ "hidden_dropout_prob": 0.1,
6
+ "hidden_size": 768,
7
+ "initializer_range": 0.02,
8
+ "intermediate_size": 3072,
9
+ "max_position_embeddings": 512,
10
+ "relative_attention": true,
11
+ "position_buckets": 256,
12
+ "norm_rel_ebd": "layer_norm",
13
+ "share_att_key": true,
14
+ "pos_att_type": "p2c|c2p",
15
+ "layer_norm_eps": 1e-7,
16
+ "max_relative_positions": -1,
17
+ "position_biased_input": false,
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "type_vocab_size": 0,
21
+ "vocab_size": 251000
22
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38fba5483de669ada10c9f5fcb70bc225ac56c64b606267899bc81c20d7825a6
3
+ size 558939619
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13c8d666d62a7bc4ac8f040aab68e942c861f93303156cc28f5c7e885d86d6e3
3
+ size 4305025
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "split_by_punct": false, "sp_model_kwargs": {}, "vocab_type": "spm", "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "/home/ml/data2/hyesu/lm-deberta/mdeberta_further_kor_base", "tokenizer_class": "DebertaV2Tokenizer"}