File size: 7,892 Bytes
8b1fa81
 
 
1d8e3ee
 
 
892b117
1d8e3ee
 
 
67d81be
f7d11dc
 
 
1d8e3ee
 
 
 
 
1118c1a
1d8e3ee
 
 
 
 
 
1118c1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d8e3ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13cdc80
1d8e3ee
13cdc80
 
1d8e3ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb6a322
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
tags:
- krx
- finance
language:
- ko
---

# krx-llm-competition Model Card

๋ชจ๋ธ์€ [KRX LLM ๊ฒฝ์ง„๋Œ€ํšŒ ๋ฆฌ๋”๋ณด๋“œ](https://krxbench.koscom.co.kr/)์—์„œ ์ตœ์ข… 3์œ„๋ฅผ ํ•œ shibainu24 ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๊ธˆ์œต, ํšŒ๊ณ„ ๋“ฑ ๊ธˆ์œต๊ด€๋ จ ์ง€์‹์— ๋Œ€ํ•œ Text Generation์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.  

+ Vanilla model : [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
  
๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘ ๋ฐ ํ•™์Šต์— ๊ด€๋ จ๋œ ์ฝ”๋“œ๋Š” [https://github.com/aiqwe/krx-llm-competition](https://github.com/aiqwe/krx-llm-competition)์— ์ž์„ธํ•˜๊ฒŒ ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
์ž์„ธํ•œ ๋‚ด์šฉ์€ [krx_model_card.pdf](krx_model_card.pdf)๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์„ธ์š”.

# Usage
[https://github.com/aiqwe/krx-llm-competition](https://github.com/aiqwe/krx-llm-competition)์˜ example์„ ์ฐธ์กฐํ•˜๋ฉด ์‰ฝ๊ฒŒ inference๋ฅผ ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋Œ€๋ถ€๋ถ„์˜ Inference๋Š” RTX-3090 ์ด์ƒ์—์„œ ๋‹จ์ผ GPU ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

```shell
pip install vllm
```

```python
import pandas as pd
from vllm import LLM

inputs = [
    "์™ธํ™˜์‹œ์žฅ์—์„œ ์ผ๋ณธ ์—”ํ™”์™€ ๋ฏธ๊ตญ ๋‹ฌ๋Ÿฌ์˜ ํ™˜์œจ์ด ๋‘ ์‹œ์žฅ์—์„œ ์•ฝ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ์ด๋•Œ ๋ฌด์œ„ํ—˜ ์ด์ต์„ ์–ป๊ธฐ ์œ„ํ•œ ์ ์ ˆํ•œ ๊ฑฐ๋ž˜ ์ „๋žต์€ ๋ฌด์—‡์ธ๊ฐ€?",
    "์‹ ์ฃผ์ธ์ˆ˜๊ถŒ๋ถ€์‚ฌ์ฑ„(BW)์—์„œ ์ฑ„๊ถŒ์ž๊ฐ€ ์‹ ์ฃผ์ธ์ˆ˜๊ถŒ์„ ํ–‰์‚ฌํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ ์–ด๋–ค ์ผ์ด ๋ฐœ์ƒํ•˜๋Š”๊ฐ€?",
    "๊ณต๋งค๋„(Short Selling)์— ๋Œ€ํ•œ ์„ค๋ช…์œผ๋กœ ์˜ณ์ง€ ์•Š์€ ๊ฒƒ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?"
]

llm = LLM(model="aiqwe/krx-llm-competition", tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(inputs, sampling_params)
for o in outputs:
    print(o.prompt)
    print(o.outputs[0].text)
    print("*"*100)
```

# Model Card
| Contents                       | Spec                                |
|--------------------------------|-------------------------------------|
| Base model                     | Qwen2.5-7B-Instruct                |
| Machine                        | A100 SXM 80GB ร— 2                  |
| dtype                          | bfloat16                           |
| PEFT                           | LoRA (r=8, alpha=64)               |
| Learning Rate                  | 1e-5 (varies by further training)  |
| LRScheduler                    | Cosine (warm-up: 0.05%)            |
| Optimizer                      | AdamW                              |
| Distributed / Efficient Tuning | DeepSpeed v3, Flash Attention      |
| Global Batch Size              | 128                                |

# Datset Card
Reference ๋ฐ์ดํ„ฐ์…‹์€ ์ผ๋ถ€ ์ €์ž‘๊ถŒ ๊ด€๊ณ„๋กœ ์ธํ•ด Link๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
MCQA์™€ QA ๋ฐ์ดํ„ฐ์…‹์€ [https://huggingface.co/datasets/aiqwe/krx-llm-competition](https://huggingface.co/datasets/aiqwe/krx-llm-competition)์œผ๋กœ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.  
ํ•ด๋‹น Huggingface Dataset Repoaitory์—์„œ๋Š” ํ•™์Šต์—๋Š” ์‚ฌ์šฉ๋˜์ง€ ์•Š์•˜์ง€๋งŒ ์ถ”๊ฐ€์ ์ธ MCQA, QA ๋ฐ์ดํ„ฐ์…‹์„ ์ œ๊ณต๋ฐ›์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.  
๋˜ํ•œ [https://github.com/aiqwe/krx-llm-competition](https://github.com/aiqwe/krx-llm-competition)๋ฅผ ์ด์šฉํ•˜๋ฉด ๋‹ค์–‘ํ•œ ์œ ํ‹ธ๋ฆฌํ‹ฐ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ ์†Œ์‹ฑ Pipeline์„ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.  

## References
| ๋ฐ์ดํ„ฐ๋ช…                          | url                                                                                      |
|-----------------------------------|------------------------------------------------------------------------------------------|
| ํ•œ๊ตญ์€ํ–‰ ๊ฒฝ์ œ๊ธˆ์œต ์šฉ์–ด 700์„       | [Link](https://www.bok.or.kr/portal/bbs/B0000249/view.do?nttId=235017&menuNo=200765) |
| ์žฌ๋ฌดํšŒ๊ณ„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ              | ์ž์ฒด ์ œ์ž‘                                                                                        |
| ๊ธˆ์œต๊ฐ๋…์šฉ์–ด์‚ฌ์ „                  | [Link](https://terms.naver.com/list.naver?cid=42088&categoryId=42088) |
| web-text.synthetic.dataset-50k    | [Link](https://huggingface.co/datasets/Cartinoe5930/web_text_synthetic_dataset_50k) |
| ์ง€์‹๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                  | [Link](https://terms.naver.com/list.naver?cid=43668&categoryId=43668) |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ ๋น„์ •๊ธฐ ๊ฐ„ํ–‰๋ฌผ          | [Link](http://open.krx.co.kr/contents/OPN04/04020000/OPN04020000.jsp#b8943a5f87282cde0d653d1ae73431c9=1) |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ๊ทœ์ •                    | [Link](https://law.krx.co.kr/las/TopFrame.jsp&KRX) |
| ์ดˆ๋ณดํˆฌ์ž์ž ์ฆ๊ถŒ๋”ฐ๋ผ์žก๊ธฐ           | [Link](https://main.krxverse.co.kr/_contents/ACA/02010200/file/220104_beginner.pdf) |
| ์ฒญ์†Œ๋…„์„ ์œ„ํ•œ ์ฆ๊ถŒํˆฌ์ž            | [Link](https://main.krxverse.co.kr/_contents/ACA/02010200/file/220104_teen.pdf) |
| ๊ธฐ์—…์‚ฌ์—…๋ณด๊ณ ์„œ ๊ณต์‹œ์ž๋ฃŒ           | [Link](https://opendart.fss.or.kr/)                              |
| ์‹œ์‚ฌ๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                  | [Link](https://terms.naver.com/list.naver?cid=43668&categoryId=43668) |

## MCQA
MCQA ๋ฐ์ดํ„ฐ๋Š” Reference๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์ง€์„ ๋‹คํ˜• ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ๋ฌธ์ œ์™€ ๋‹ต ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Reasoning ํ…์ŠคํŠธ๊นŒ์ง€ ์ƒ์„ฑํ•˜์—ฌ ํ•™์Šต์— ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.  
ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 4.5๋งŒ๊ฐœ ๋ฐ์ดํ„ฐ์…‹์ด๋ฉฐ, tiktoken์˜ o200k_base(gpt-4o, gpt-4o-mini Tokenizer)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ด 2์ฒœ๋งŒ๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
| ๋ฐ์ดํ„ฐ๋ช…                             | ๋ฐ์ดํ„ฐ ์ˆ˜ | ํ† ํฐ ์ˆ˜      |
|--------------------------------------|-----------|--------------|
| ํ•œ๊ตญ์€ํ–‰ ๊ฒฝ์ œ๊ธˆ์œต ์šฉ์–ด 700์„          | 1,203     | 277,114      |
| ์žฌ๋ฌดํšŒ๊ณ„ ๋ชฉ์ฐจ๋ฅผ ์ด์šฉํ•œ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ    | 451       | 99,770       |
| ๊ธˆ์œต๊ฐ๋…์šฉ์–ด์‚ฌ์ „                     | 827       | 214,297      |
| hf_web_text_synthetic_dataset_50k    | 25,461    | 7,563,529    |
| ์ง€์‹๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 2,314     | 589,763      |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ ๋น„์ •๊ธฐ ๊ฐ„ํ–‰๋ฌผ             | 1,183     | 230,148      |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ๊ทœ์ •                       | 3,015     | 580,556      |
| ์ดˆ๋ณดํˆฌ์ž์ž ์ฆ๊ถŒ๋”ฐ๋ผ์žก๊ธฐ              | 599       | 116,472      |
| ์ฒญ์†Œ๋…„์„ ์œ„ํ•œ ์ฆ๊ถŒ ํˆฌ์ž              | 408       | 77,037       |
| ๊ธฐ์—…์‚ฌ์—…๋ณด๊ณ ์„œ ๊ณต์‹œ์ž๋ฃŒ              | 3,574     | 629,807      |
| ์‹œ์‚ฌ๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 7,410     | 1,545,842    |
| **ํ•ฉ๊ณ„**                             | **46,445**| **19,998,931**|

## QA
QA ๋ฐ์ดํ„ฐ๋Š” Reference์™€ ์งˆ๋ฌธ์„ ํ•จ๊ป˜ Input์œผ๋กœ ๋ฐ›์•„ ์ƒ์„ฑํ•œ ๋‹ต๋ณ€๊ณผ Reference ์—†์ด ์งˆ๋ฌธ๋งŒ์„ Input์œผ๋กœ ๋ฐ›์•„ ์ƒ์„ฑํ•œ ๋‹ต๋ณ€ 2๊ฐ€์ง€๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.  
Reference๋ฅผ ์ œ๊ณต๋ฐ›์œผ๋ฉด ๋ชจ๋ธ์€ ๋ณด๋‹ค ์ •ํ™•ํ•œ ๋‹ต๋ณ€์„ ํ•˜์ง€๋งŒ ๋ชจ๋ธ๋งŒ์˜ ์ง€์‹์ด ์ œํ•œ๋˜์–ด ๋‹ต๋ณ€์ด ์ข€๋” ์งง์•„์ง€๊ฑฐ๋‚˜ ๋‹ค์–‘์„ฑ์ด ์ค„์–ด๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
์ด 4.8๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹๊ณผ 2์–ต๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
| ๋ฐ์ดํ„ฐ๋ช…                             | ๋ฐ์ดํ„ฐ ์ˆ˜ | ํ† ํฐ ์ˆ˜      |
|--------------------------------------|-----------|--------------|
| ํ•œ๊ตญ์€ํ–‰ ๊ฒฝ์ œ๊ธˆ์œต ์šฉ์–ด 700์„          | 1,023     | 846,970      |
| ๊ธˆ์œต๊ฐ๋…์šฉ์–ด์‚ฌ์ „                     | 4,128     | 3,181,831    |
| ์ง€์‹๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 6,526     | 5,311,890    |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ ๋น„์ •๊ธฐ ๊ฐ„ํ–‰๋ฌผ             | 1,510     | 1,089,342    |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ๊ทœ์ •                       | 4,858     | 3,587,059    |
| ๊ธฐ์—…์‚ฌ์—…๋ณด๊ณ ์„œ ๊ณต์‹œ์ž๋ฃŒ              | 3,574     | 629,807      |
| ์‹œ์‚ฌ๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 29,920    | 5,981,839    |
| **ํ•ฉ๊ณ„**                             | **47,965**| **199,998,931**|

# Citation
๋ณธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ์ธ์šฉํ•  ๊ฒฝ์šฐ ์ถœ์ฒ˜๋ฅผ ๋‚จ๊ฒจ์ฃผ์„ธ์š”.
```bibitex
@misc{jaylee2024krxllmcompetition,
  author = {jay lee},
  title = {shibainu24: krx llm completition llm model},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/aiqwe/krx-llm-competition}
}
```