๋ฐ์ดํฐ ์ ์ ์ค์ธก ์กฐ์ฌ ๊ฒฐ๊ณผ
์กฐ์ฌ์ผ: 2026-02-27 | ์ด ๋์คํฌ ์ฌ์ฉ๋: 195GB
1. Pretrain ๋ฐ์ดํฐ (.bin ํ์ผ) โ ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ
| ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ ์ | ๋น๊ณ |
|---|---|---|---|
korean_train.bin |
17GB | 8.9B | ํตํฉ (c4+wiki+namuwiki ๋จธ์ง) |
korean_val.bin |
35MB | 17.9M | ํตํฉ val |
korean_c4_train.bin |
15GB | 7.5B | C4 ํ๊ตญ์ด |
korean_c4_val.bin |
29MB | 15.2M | |
korean_namuwiki_train.bin |
2.1GB | 1.1B | ๋๋ฌด์ํค |
korean_namuwiki_val.bin |
4.2MB | 2.2M | |
korean_wiki_train.bin |
500MB | 261.8M | ํ๊ตญ์ด ์ํค |
korean_wiki_val.bin |
1.1MB | 524K | |
train.bin |
1.2GB | 605M | ์์ด ์ํค (Shakespeare ๋ฑ) |
val.bin |
5.8MB | 3.0M |
Pretrain ํ ํฐ ํฉ๊ณ
- korean_train.bin (ํตํฉ): 8.9B tokens โ C4 + Wiki + Namuwiki ๋จธ์ง๋ณธ
- ๊ฐ๋ณ ํฉ์ฐ (c4 7.5B + wiki 0.26B + namuwiki 1.1B = 8.86B) โ ํตํฉ๋ณธ๊ณผ ์ผ์น
- ์์ด train.bin: 605M tokens
- โ ๏ธ korean_train.bin์ ๊ฐ๋ณ .bin์ ๋จธ์ง์ด๋ฏ๋ก ์ค๋ณต ๊ณ์ฐ ์ฃผ์
- ๋น์ค๋ณต Pretrain ์ดํฉ: ~9.5B tokens (ํ๊ตญ์ด 8.9B + ์์ด 0.6B)
2. korean_extra (HuggingFace ๋ค์ด๋ก๋) โ ์ฒ๋ฆฌ ํ์
| ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ํฌ๋งท | ์ถ์ ํ ํฐ |
|---|---|---|---|
culturax_ko |
60GB | parquet | ~15B+ |
hplt_ko |
23GB | parquet | ~6B |
cc100_ko |
14GB | parquet/txt | ~3.5B |
oscar_ko |
9.2GB | parquet | ~2.3B |
korean_textbooks |
6.4GB | parquet | ~1.6B |
korean_webtext |
4.2GB | parquet | ~1B |
finepdfs_edu_ko |
2.9GB | parquet | ~700M |
namuwiki_extracted |
2.2GB | parquet | ~550M |
wikipedia_korean |
1.7GB | parquet | ~400M |
kovast |
449MB | parquet | ~110M |
evol_instruct_ko |
144MB | parquet/json | ~35M (SFT์ฉ) |
korean_safe_conv |
51MB | parquet/json | ~12M (SFT์ฉ) |
korean_extra ์ดํฉ: ~123GB, ์ถ์ ~30B+ tokens (ํ ํฐํ ์ , ์๋ฌธ ๊ธฐ์ค)
3. SFT ๋ฐ์ดํฐ โ ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ
| ํ์ผ | ํฌ๊ธฐ | ์ํ ์ |
|---|---|---|
sft/train.jsonl |
276MB | 161,848 |
sft/val.jsonl |
15MB | 8,518 |
- ์ด SFT ์ํ: 170,366
- ํฌ๋งท: instruction/output ์, ํ๊ตญ์ด ๋ฒ์ญ ๋ฐ์ดํฐ
- ํ์ง: ์ํธ (์์ฐ์ค๋ฌ์ด ํ๊ตญ์ด, ๋ค์ํ ์ฃผ์ )
4. Raw ํ ์คํธ ๋ฐ์ดํฐ โ ์ด๋ฏธ .bin์ผ๋ก ๋ณํ ์๋ฃ
| ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ํ์ผ ์ | ๋น๊ณ |
|---|---|---|---|
raw/c4_ko/ |
30GB | 50๊ฐ txt | โ korean_c4_train.bin์ผ๋ก ๋ณํ๋จ |
raw/namuwiki_ko/ |
5.7GB | 6๊ฐ txt | โ korean_namuwiki_train.bin์ผ๋ก ๋ณํ๋จ |
raw/ko_wiki_*.txt |
1.2GB | 5๊ฐ txt | โ korean_wiki_train.bin์ผ๋ก ๋ณํ๋จ |
raw/en_wiki_*.txt |
1.2GB | 3๊ฐ txt | โ train.bin์ผ๋ก ๋ณํ๋จ |
| raw ํฉ๊ณ | 38GB | 64๊ฐ | ์ญ์ ๊ฐ๋ฅ (๋์คํฌ ์ ์ฝ) |
5. ์ข ํฉ ์์ฝ
์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ
| ์ฉ๋ | ๋ฐ์ดํฐ | ๊ท๋ชจ |
|---|---|---|
| Pretrain | korean_train.bin + train.bin | 9.5B tokens |
| SFT | sft/train.jsonl | 161,848 ์ํ |
์ฒ๋ฆฌํ๋ฉด ์ถ๊ฐ ํ๋ณด ๊ฐ๋ฅ
| ์์ค | ์ถ์ ๊ท๋ชจ | ํ์ ์์ |
|---|---|---|
| korean_extra (์ ์ฒด) | ~30B+ tokens | ํ ํฐํ โ .bin ๋ณํ |
| evol_instruct_ko + korean_safe_conv | ~47M tokens (SFT) | JSONL ๋ณํ |
๋์คํฌ ์ ์ฝ ๊ฐ๋ฅ
raw/38GB โ ์ด๋ฏธ .bin ๋ณํ ์๋ฃ, ์ญ์ ๊ฐ๋ฅ- ๊ฐ๋ณ .bin (c4/wiki/namuwiki) โ korean_train.bin ๋จธ์ง ํ ์ค๋ณต, ์ญ์ ๊ฐ๋ฅ (~18GB)
์ต์ข ์ ์ฌ๋ ฅ
- Pretrain: ํ์ฌ 9.5B + korean_extra 30B+ = ~40B tokens ํ๋ณด ๊ฐ๋ฅ
- SFT: ํ์ฌ 162K + ์ถ๊ฐ ๋ณํ = ~200K+ ์ํ ๊ฐ๋ฅ