initial commit
Browse files- LICENSE +21 -0
- README.md +103 -0
- config.json +36 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +9 -0
- tokenizer.json +0 -0
- tokenizer_config.json +16 -0
- vocab.txt +0 -0
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2023 kakaobank
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
CHANGED
@@ -1,3 +1,106 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- ko
|
5 |
+
pipeline_tag: fill-mask
|
6 |
---
|
7 |
+
# KF-DeBERTa
|
8 |
+
카카오뱅크 & 에프엔가이드에서 학습한 금융 도메인 특화 언어모델을 공개합니다.
|
9 |
+
|
10 |
+
## Model description
|
11 |
+
* KF-DeBERTa는 범용 도메인 말뭉치와 금융 도메인 말뭉치를 함께 학습한 언어모델 입니다.
|
12 |
+
* 모델 아키텍쳐는 [DeBERTa-v2](https://github.com/microsoft/DeBERTa#whats-new-in-v2)를 기반으로 학습하였습니다.
|
13 |
+
* ELECTRA의 RTD를 training objective로 사용한 DeBERTa-v3는 일부 task(KLUE-RE, WoS, Retrieval)에서 상당히 낮은 성능을 확인하여 최종 아키텍쳐는 DeBERTa-v2로 결정하였습니다.
|
14 |
+
* 범용 도메인 및 금융 도메인 downstream task에서 모두 우수한 성능을 확인하였습니다.
|
15 |
+
* 금융 도메인 downstream task의 철저한 성능검증을 위해 다양한 데이터셋을 통해 검증을 수행하였습니다.
|
16 |
+
* 범용 도메인 및 금융 도메인에서 기존 언어모델보다 더 나은 성능을 보여줬으며 특히 KLUE Benchmark에서는 RoBERTa-Large보다 더 나은 성능을 확인하였습니다.
|
17 |
+
|
18 |
+
## Usage
|
19 |
+
```python3
|
20 |
+
from transformers import AutoModel, AutoTokenizer
|
21 |
+
|
22 |
+
model = AutoModel.from_pretrained("kakaobank/kf-deberta-base")
|
23 |
+
tokenizer = AutoTokenizer.from_pretrained("kakaobank/kf-deberta-base")
|
24 |
+
|
25 |
+
text = "카카오뱅크와 에프엔가이드가 금융특화 언어모델을 공개합니다."
|
26 |
+
tokens = tokenizer.tokenize(text)
|
27 |
+
print(tokens)
|
28 |
+
|
29 |
+
inputs = tokenizer(text, return_tensors="pt")
|
30 |
+
model_output = model(**inputs)
|
31 |
+
print(model_output)
|
32 |
+
```
|
33 |
+
|
34 |
+
## Benchmark
|
35 |
+
* 모든 task는 아래와 같은 기본적인 hyperparameter search만 수행하였습니다.
|
36 |
+
* batch size: {16, 32}
|
37 |
+
* learning_rate: {1e-5, 3e-5, 5e-5}
|
38 |
+
* weight_decay: {0, 0.01}
|
39 |
+
* warmup_proportion: {0, 0.1}
|
40 |
+
|
41 |
+
**KLUE Benchmark**
|
42 |
+
|
43 |
+
| Model | YNAT | KLUE-ST | KLUE-NLI | KLUE-NER | KLUE-RE | KLUE-DP | KLUE-MRC | WoS | AVG |
|
44 |
+
|:--------------------:|:----------------:|:----------------------:|:------------:|:---------------------------------:|:-----------------------------:|:----------------------:|:-------------------------:|:----------------------:|:----------------:|
|
45 |
+
| | F1 | Pearsonr/F1 | ACC | F1-Entity/F1-Char | F1-micro/AUC | UAS/LAS | EM/ROUGE | JGA/F1-S | |
|
46 |
+
| mBERT (Base) | 82.64 | 82.97/75.93 | 72.90 | 75.56/88.81 | 58.39/56.41 | 88.53/86.04 | 49.96/55.57 | 35.27/88.60 | 71.26 |
|
47 |
+
| XLM-R (Base) | 84.52 | 88.88/81.20 | 78.23 | 80.48/92.14 | 57.62/57.05 | 93.12/87.23 | 26.76/53.36 | 41.54/89.81 | 72.28 |
|
48 |
+
| XLM-R (Large) | 87.30 | 93.08/87.17 | 86.40 | 82.18/93.20 | 58.75/63.53 | 92.87/87.82 | 35.23/66.55 | 42.44/89.88 | 76.17 |
|
49 |
+
| KR-BERT (Base) | 85.36 | 87.50/77.92 | 77.10 | 74.97/90.46 | 62.83/65.42 | 92.87/87.13 | 48.95/58.38 | 45.60/90.82 | 74.67 |
|
50 |
+
| KoELECTRA (Base) | 85.99 | 93.14/85.89 | 86.87 | 86.06/92.75 | 62.67/57.46 | 90.93/87.07 | 59.54/65.64 | 39.83/88.91 | 77.34 |
|
51 |
+
| KLUE-BERT (Base) | 86.95 | 91.01/83.44 | 79.87 | 83.71/91.17 | 65.58/68.11 | 93.07/87.25 | 62.42/68.15 | 46.72/91.59 | 78.50 |
|
52 |
+
| KLUE-RoBERTa (Small) | 85.95 | 91.70/85.42 | 81.00 | 83.55/91.20 | 61.26/60.89 | 93.47/87.50 | 58.25/63.56 | 46.65/91.50 | 77.28 |
|
53 |
+
| KLUE-RoBERTa (Base) | 86.19 | 92.91/86.78 | 86.30 | 83.81/91.09 | 66.73/68.11 | 93.75/87.77 | 69.56/74.64 | 47.41/91.60 | 80.48 |
|
54 |
+
| KLUE-RoBERTa (Large) | 85.88 | 93.20/86.13 | **89.50** | 84.54/91.45 | **71.06**/73.33 | 93.84/87.93 | **75.26**/**80.30** | 49.39/92.19 | 82.43 |
|
55 |
+
| KF-DeBERTa (Base) | **<u>87.51</u>** | **<u>93.24/87.73</u>** | <u>88.37</u> | **<u>89.17</u>**/**<u>93.30</u>** | <u>69.70</u>/**<u>75.07</u>** | **<u>94.05/87.97</u>** | <u>72.59</u>/<u>78.08</u> | **<u>50.21/92.59</u>** | **<u>82.83</u>** |
|
56 |
+
|
57 |
+
* 굵은글씨는 모든 모델중 가장높은 점수이며, 밑줄은 base 모델 중 가장 높은 점수입니다.
|
58 |
+
|
59 |
+
**금융도메인 벤치마크**
|
60 |
+
| Model | FN-Sentiment (v1) | FN-Sentiment (v2) | FN-Adnews | FN-NER | KorFPB | KorFiQA-SA | KorHeadline | Avg (FiQA-SA 제외) |
|
61 |
+
|:-------------------:|:-----------------:|:-----------------:|:---------:|:---------:|:---------:|:----------:|:-----------:|:-----------------:|
|
62 |
+
| | ACC | ACC | ACC | F1-micro | ACC | MSE | Mean F1 | |
|
63 |
+
| KLUE-RoBERTa (Base) | 98.26 | 91.21 | 96.34 | 90.31 | 90.97 | 0.0589 | 81.11 | 94.03 |
|
64 |
+
| KoELECTRA (Base) | 98.26 | 90.56 | 96.98 | 89.81 | 92.36 | 0.0652 | 80.69 | 93.90 |
|
65 |
+
| KF-DeBERTa (Base) | **99.36** | **92.29** | **97.63** | **91.80** | **93.47** | **0.0553** | **82.12** | **95.27** |
|
66 |
+
|
67 |
+
* **FN-Sentiment**: 금융도메인 감성분석
|
68 |
+
* **FN-Adnews**: 금융도메인 광고성기사 분류
|
69 |
+
* **FN-NER**: 금융도메인 개체명인식
|
70 |
+
* **KorFPB**: FinancialPhraseBank 번역데이터
|
71 |
+
* Cite: ```Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.```
|
72 |
+
* **KorFiQA-SA**: FiQA-SA 번역데이터
|
73 |
+
* Cite: ```Maia, Macedo & Handschuh, Siegfried & Freitas, Andre & Davis, Brian & McDermott, Ross & Zarrouk, Manel & Balahur, Alexandra. (2018). WWW'18 Open Challenge: Financial Opinion Mining and Question Answering. WWW '18: Companion Proceedings of the The Web Conference 2018. 1941-1942. 10.1145/3184558.3192301.```
|
74 |
+
* **KorHeadline**: Gold Commodity News and Dimensions 번역데이터
|
75 |
+
* Cite: ```Sinha, A., & Khandait, T. (2021, April). Impact of News on the Commodity Market: Dataset and Results. In
|
76 |
+
Future of Information and Communication Conference (pp. 589-601). Springer, Cham.```
|
77 |
+
|
78 |
+
|
79 |
+
**범용도메인 벤치마크**
|
80 |
+
| Model | NSMC | PAWS | KorNLI | KorSTS | KorQuAD | Avg (KorQuAD 제외) |
|
81 |
+
|:-------------------:|:---------:|:---------:|:---------:|:---------:|:---------------:|:----------------:|
|
82 |
+
| | ACC | ACC | ACC | spearman | EM/F1 | |
|
83 |
+
| KLUE-RoBERTa (Base) | 90.47 | 84.79 | 81.65 | 84.40 | 86.34/94.40 | 85.33 |
|
84 |
+
| KoELECTRA (Base) | 90.63 | 84.45 | 82.24 | 85.53 | 84.83/93.45 | 85.71 |
|
85 |
+
| KF-DeBERTa (Base) | **91.36** | **86.14** | **84.54** | **85.99** | **86.60/95.07** | **87.01** |
|
86 |
+
|
87 |
+
|
88 |
+
|
89 |
+
## License
|
90 |
+
KF-DeBERTa의 소스코드 및 모델은 MIT 라이선스 하에 공개되어 있습니다.
|
91 |
+
라이선스 전문은 [MIT 파일](LICENSE)에서 확인할 수 있습니다.
|
92 |
+
모델의 사용으로 인해 발생한 어떠한 손해에 대해서도 당사는 책임을 지지 않습니다.
|
93 |
+
|
94 |
+
## Citation
|
95 |
+
```
|
96 |
+
@proceedings{jeon-etal-2023-kfdeberta,
|
97 |
+
title = {KF-DeBERTa: Financial Domain-specific Pre-trained Language Model},
|
98 |
+
author = {Eunkwang Jeon, Jungdae Kim, Minsang Song, and Joohyun Ryu},
|
99 |
+
booktitle = {Proceedings of the 35th Annual Conference on Human and Cognitive Language Technology},
|
100 |
+
moth = {oct},
|
101 |
+
year = {2023},
|
102 |
+
publisher = {Korean Institute of Information Scientists and Engineers},
|
103 |
+
url = {http://www.hclt.kr/symp/?lnb=conference},
|
104 |
+
pages = {143--148},
|
105 |
+
}
|
106 |
+
```
|
config.json
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"DebertaV2ForMaskedLM"
|
4 |
+
],
|
5 |
+
"attention_probs_dropout_prob": 0.1,
|
6 |
+
"conv_act": "gelu",
|
7 |
+
"conv_kernel_size": 0,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 768,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"intermediate_size": 3072,
|
13 |
+
"layer_norm_eps": 1e-07,
|
14 |
+
"max_position_embeddings": 512,
|
15 |
+
"max_relative_positions": -1,
|
16 |
+
"model_type": "deberta-v2",
|
17 |
+
"norm_rel_ebd": "layer_norm",
|
18 |
+
"num_attention_heads": 12,
|
19 |
+
"num_hidden_layers": 12,
|
20 |
+
"pad_token_id": 0,
|
21 |
+
"pooler_dropout": 0,
|
22 |
+
"pooler_hidden_act": "gelu",
|
23 |
+
"pooler_hidden_size": 768,
|
24 |
+
"pos_att_type": [
|
25 |
+
"p2c",
|
26 |
+
"c2p"
|
27 |
+
],
|
28 |
+
"position_biased_input": false,
|
29 |
+
"position_buckets": 256,
|
30 |
+
"relative_attention": true,
|
31 |
+
"share_att_key": true,
|
32 |
+
"torch_dtype": "float32",
|
33 |
+
"transformers_version": "4.27.4",
|
34 |
+
"type_vocab_size": 0,
|
35 |
+
"vocab_size": 130000
|
36 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3cd6cd7811b3c9190e97cae7eb41571c2bc0076431baae7d41d449a8c1c18c6c
|
3 |
+
size 745694717
|
special_tokens_map.json
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": "[CLS]",
|
3 |
+
"cls_token": "[CLS]",
|
4 |
+
"eos_token": "[SEP]",
|
5 |
+
"mask_token": "[MASK]",
|
6 |
+
"pad_token": "[PAD]",
|
7 |
+
"sep_token": "[SEP]",
|
8 |
+
"unk_token": "[UNK]"
|
9 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"do_lower_case": false,
|
3 |
+
"do_basic_tokenize": true,
|
4 |
+
"never_split": null,
|
5 |
+
"unk_token": "[UNK]",
|
6 |
+
"sep_token": "[SEP]",
|
7 |
+
"pad_token": "[PAD]",
|
8 |
+
"cls_token": "[CLS]",
|
9 |
+
"mask_token": "[MASK]",
|
10 |
+
"bos_token": "[CLS]",
|
11 |
+
"eos_token": "[SEP]",
|
12 |
+
"tokenize_chinese_chars": true,
|
13 |
+
"strip_accents": null,
|
14 |
+
"model_max_length": 512,
|
15 |
+
"tokenizer_class": "BertTokenizer"
|
16 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|