File size: 3,103 Bytes
806f077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
820c520
806f077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f858628
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
language:
- ko

---

# KR-FinBert & KR-FinBert-SC

Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adaptation using small-scale corpus and fine-tuning with labeled data is effective for overall performance improvement. 
we proposed KR-FinBert for the financial domain by further pre-training it on a financial corpus and fine-tuning it for sentiment analysis. As many studies have shown, the performance improvement through adaptation and conducting the downstream task was also clear in this experiment. 

![KR-FinBert](https://huggingface.co/snunlp/KR-FinBert/resolve/main/images/KR-FinBert.png)

## Data

The training data for this model is expanded from those of **[KR-BERT-MEDIUM](https://huggingface.co/snunlp/KR-Medium)**, texts from Korean Wikipedia, general news articles, legal texts crawled from the National Law Information Center and [Korean Comments dataset](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments). For the transfer learning, **corporate related economic news articles from 72 media sources** such as the Financial Times, The Korean Economy Daily, etc and **analyst reports from 16 securities companies** such as Kiwoom Securities, Samsung Securities, etc are added. Included in the dataset is 440,067 news titles with their content and 11,237 analyst reports. **The total data size is about 13.22GB.** For mlm training, we split the data line by line and **the total no. of lines is 6,379,315.**
KR-FinBert is trained for 5.5M steps with the maxlen of 512, training batch size of 32, and learning rate of 5e-5, taking 67.48 hours to train the model using NVIDIA TITAN XP.


## Downstream tasks
### Sentimental Classification model

Downstream task performances with 50,000 labeled data.

|Model|Accuracy|
|-|-|
|KR-FinBert|0.963|
|KR-BERT-MEDIUM|0.958|
|KcBert-large|0.955|
|KcBert-base|0.953|
|KoBert|0.817|

### Inference sample

|Positive|Negative|
|-|-|
|ํ˜„๋Œ€๋ฐ”์ด์˜ค, 'ํด๋ฆฌํƒ์…€' ์ฝ”๋กœ๋‚˜19 ์น˜๋ฃŒ ๊ฐ€๋Šฅ์„ฑ์— 19% ๊ธ‰๋“ฑ | ์˜ํ™”๊ด€ๆ ช '์ฝ”๋กœ๋‚˜ ๋น™ํ•˜๊ธฐ' ์–ธ์ œ ๋๋‚˜๋‚˜โ€ฆ"CJ CGV ์˜ฌ 4000์–ต ์†์‹ค ๋‚ ์ˆ˜๋„"ย |
|์ด์ˆ˜ํ™”ํ•™, 3๋ถ„๊ธฐย ์˜์—…์ตย 176์–ตโ€ฆ์ „๋…„ๆฏ”ย 80%โ†‘ | C์‡ผํฌ์—ย ๋ฉˆ์ถ˜ย ํ‘์ž๋น„ํ–‰โ€ฆ๋Œ€ํ•œํ•ญ๊ณตย 1๋ถ„๊ธฐย ์˜์—…์ ์žย 566์–ตย |
|"GKL, 7๋…„ย ๋งŒ์—ย ๋‘ย ์ž๋ฆฟ์ˆ˜ย ๋งค์ถœ์„ฑ์žฅย ์˜ˆ์ƒ" | '1000์–ต๋Œ€ย ํšก๋ นยท๋ฐฐ์ž„'ย ์ตœ์‹ ์›ย ํšŒ์žฅ ๊ตฌ์†โ€ฆย SK๋„คํŠธ์›์Šคย "๊ฒฝ์˜ ๊ณต๋ฐฑ ๋ฐฉ์ง€ ์ตœ์„ "ย |
|์œ„์ง€์œ…์ŠคํŠœ๋””์˜ค, ์ฝ˜ํ…์ธ  ํ™œ์•ฝ์— ์‚ฌ์ƒ ์ฒซ ๋งค์ถœ 1000์–ต์› ๋ŒํŒŒ | ๋ถ€ํ’ˆ ๊ณต๊ธ‰ ์ฐจ์งˆ์—โ€ฆ๊ธฐ์•„์ฐจย ๊ด‘์ฃผ๊ณต์žฅ ์ „๋ฉด ๊ฐ€๋™ ์ค‘๋‹จย |
|์‚ผ์„ฑ์ „์ž, 2๋…„ ๋งŒ์— ์ธ๋„ ์Šค๋งˆํŠธํฐ ์‹œ์žฅ ์ ์œ ์œจ 1์œ„ '์™•์ขŒ ํƒˆํ™˜' | ํ˜„๋Œ€์ œ์ฒ , ์ง€๋‚œํ•ดย ์˜์—…์ตย 3,313์–ต์›ยทยทยท์ „๋…„ๆฏ”ย 67.7%ย ๊ฐ์†Œย |


### Citation

```
@misc{kr-FinBert-SC,
  author = {Kim, Eunhee and Hyopil Shin},
  title = {KR-FinBert: Fine-tuning KR-FinBert for Sentiment Analysis},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://huggingface.co/snunlp/KR-FinBert-SC}}
}
```