pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

๋ฐ์ดํ„ฐ ์ „์ˆ˜ ์‹ค์ธก ์กฐ์‚ฌ ๊ฒฐ๊ณผ

์กฐ์‚ฌ์ผ: 2026-02-27 | ์ด ๋””์Šคํฌ ์‚ฌ์šฉ๋Ÿ‰: 195GB


1. Pretrain ๋ฐ์ดํ„ฐ (.bin ํŒŒ์ผ) โ€” ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

ํŒŒ์ผ ํฌ๊ธฐ ์ถ”์ • ํ† ํฐ ์ˆ˜ ๋น„๊ณ 
korean_train.bin 17GB 8.9B ํ†ตํ•ฉ (c4+wiki+namuwiki ๋จธ์ง€)
korean_val.bin 35MB 17.9M ํ†ตํ•ฉ val
korean_c4_train.bin 15GB 7.5B C4 ํ•œ๊ตญ์–ด
korean_c4_val.bin 29MB 15.2M
korean_namuwiki_train.bin 2.1GB 1.1B ๋‚˜๋ฌด์œ„ํ‚ค
korean_namuwiki_val.bin 4.2MB 2.2M
korean_wiki_train.bin 500MB 261.8M ํ•œ๊ตญ์–ด ์œ„ํ‚ค
korean_wiki_val.bin 1.1MB 524K
train.bin 1.2GB 605M ์˜์–ด ์œ„ํ‚ค (Shakespeare ๋“ฑ)
val.bin 5.8MB 3.0M

Pretrain ํ† ํฐ ํ•ฉ๊ณ„

  • korean_train.bin (ํ†ตํ•ฉ): 8.9B tokens โ† C4 + Wiki + Namuwiki ๋จธ์ง€๋ณธ
  • ๊ฐœ๋ณ„ ํ•ฉ์‚ฐ (c4 7.5B + wiki 0.26B + namuwiki 1.1B = 8.86B) โ†’ ํ†ตํ•ฉ๋ณธ๊ณผ ์ผ์น˜
  • ์˜์–ด train.bin: 605M tokens
  • โš ๏ธ korean_train.bin์€ ๊ฐœ๋ณ„ .bin์˜ ๋จธ์ง€์ด๋ฏ€๋กœ ์ค‘๋ณต ๊ณ„์‚ฐ ์ฃผ์˜
  • ๋น„์ค‘๋ณต Pretrain ์ดํ•ฉ: ~9.5B tokens (ํ•œ๊ตญ์–ด 8.9B + ์˜์–ด 0.6B)

2. korean_extra (HuggingFace ๋‹ค์šด๋กœ๋“œ) โ€” ์ฒ˜๋ฆฌ ํ•„์š”

๋””๋ ‰ํ† ๋ฆฌ ํฌ๊ธฐ ํฌ๋งท ์ถ”์ • ํ† ํฐ
culturax_ko 60GB parquet ~15B+
hplt_ko 23GB parquet ~6B
cc100_ko 14GB parquet/txt ~3.5B
oscar_ko 9.2GB parquet ~2.3B
korean_textbooks 6.4GB parquet ~1.6B
korean_webtext 4.2GB parquet ~1B
finepdfs_edu_ko 2.9GB parquet ~700M
namuwiki_extracted 2.2GB parquet ~550M
wikipedia_korean 1.7GB parquet ~400M
kovast 449MB parquet ~110M
evol_instruct_ko 144MB parquet/json ~35M (SFT์šฉ)
korean_safe_conv 51MB parquet/json ~12M (SFT์šฉ)

korean_extra ์ดํ•ฉ: ~123GB, ์ถ”์ • ~30B+ tokens (ํ† ํฐํ™” ์ „, ์›๋ฌธ ๊ธฐ์ค€)


3. SFT ๋ฐ์ดํ„ฐ โ€” ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

ํŒŒ์ผ ํฌ๊ธฐ ์ƒ˜ํ”Œ ์ˆ˜
sft/train.jsonl 276MB 161,848
sft/val.jsonl 15MB 8,518
  • ์ด SFT ์ƒ˜ํ”Œ: 170,366
  • ํฌ๋งท: instruction/output ์Œ, ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ
  • ํ’ˆ์งˆ: ์–‘ํ˜ธ (์ž์—ฐ์Šค๋Ÿฌ์šด ํ•œ๊ตญ์–ด, ๋‹ค์–‘ํ•œ ์ฃผ์ œ)

4. Raw ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ โ€” ์ด๋ฏธ .bin์œผ๋กœ ๋ณ€ํ™˜ ์™„๋ฃŒ

๋””๋ ‰ํ† ๋ฆฌ ํฌ๊ธฐ ํŒŒ์ผ ์ˆ˜ ๋น„๊ณ 
raw/c4_ko/ 30GB 50๊ฐœ txt โ†’ korean_c4_train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ
raw/namuwiki_ko/ 5.7GB 6๊ฐœ txt โ†’ korean_namuwiki_train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ
raw/ko_wiki_*.txt 1.2GB 5๊ฐœ txt โ†’ korean_wiki_train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ
raw/en_wiki_*.txt 1.2GB 3๊ฐœ txt โ†’ train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ
raw ํ•ฉ๊ณ„ 38GB 64๊ฐœ ์‚ญ์ œ ๊ฐ€๋Šฅ (๋””์Šคํฌ ์ ˆ์•ฝ)

5. ์ข…ํ•ฉ ์š”์•ฝ

์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

์šฉ๋„ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ
Pretrain korean_train.bin + train.bin 9.5B tokens
SFT sft/train.jsonl 161,848 ์ƒ˜ํ”Œ

์ฒ˜๋ฆฌํ•˜๋ฉด ์ถ”๊ฐ€ ํ™•๋ณด ๊ฐ€๋Šฅ

์†Œ์Šค ์ถ”์ • ๊ทœ๋ชจ ํ•„์š” ์ž‘์—…
korean_extra (์ „์ฒด) ~30B+ tokens ํ† ํฐํ™” โ†’ .bin ๋ณ€ํ™˜
evol_instruct_ko + korean_safe_conv ~47M tokens (SFT) JSONL ๋ณ€ํ™˜

๋””์Šคํฌ ์ ˆ์•ฝ ๊ฐ€๋Šฅ

  • raw/ 38GB โ†’ ์ด๋ฏธ .bin ๋ณ€ํ™˜ ์™„๋ฃŒ, ์‚ญ์ œ ๊ฐ€๋Šฅ
  • ๊ฐœ๋ณ„ .bin (c4/wiki/namuwiki) โ†’ korean_train.bin ๋จธ์ง€ ํ›„ ์ค‘๋ณต, ์‚ญ์ œ ๊ฐ€๋Šฅ (~18GB)

์ตœ์ข… ์ž ์žฌ๋ ฅ

  • Pretrain: ํ˜„์žฌ 9.5B + korean_extra 30B+ = ~40B tokens ํ™•๋ณด ๊ฐ€๋Šฅ
  • SFT: ํ˜„์žฌ 162K + ์ถ”๊ฐ€ ๋ณ€ํ™˜ = ~200K+ ์ƒ˜ํ”Œ ๊ฐ€๋Šฅ