Tiny-Ko-Stories-35M

English version is available below.

Tiny-Ko-Stories-35M은 Tiny-Ko-Stories μ½”νΌμŠ€λ‘œ ν•™μŠ΅ν•œ 35MκΈ‰ ν•œκ΅­μ–΄ continuation LMμž…λ‹ˆλ‹€.

이 λͺ¨λΈμ€ instruction-tuned assistantκ°€ μ•„λ‹ˆλΌ, μ΄μ•ΌκΈ°μ˜ 첫 λ¬Έμž₯을 λ„£μœΌλ©΄ λ’€λ₯Ό 이어 μ“°λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.

Tiny-Ko-StoriesλŠ” μ˜μ–΄ TinyStoriesλ₯Ό λ²ˆμ—­ν•œ 데이터셋이 μ•„λ‹™λ‹ˆλ‹€. ν•œκ΅­μ–΄λ‹€μš΄ 이름, λ¬Έμž₯ 리듬, μ˜μ„±μ–΄μ™€ μ˜νƒœμ–΄, 색채어, μž‘μ€ 문제-ν•΄κ²° ꡬ쑰λ₯Ό ν¬ν•¨ν•˜κΈ° μœ„ν•΄ μ²˜μŒλΆ€ν„° ν•œκ΅­μ–΄λ‘œ μƒμ„±ν•˜κ³  μ •μ œν•œ 이야기 μ½”νΌμŠ€μž…λ‹ˆλ‹€.

이 μ €μž₯μ†ŒλŠ” μ»€μŠ€ν…€ PyTorch μ²΄ν¬ν¬μΈνŠΈμ™€ λͺ¨λΈ μ •μ˜, ν† ν¬λ‚˜μ΄μ €, 생성 슀크립트λ₯Ό ν•¨κ»˜ μ œκ³΅ν•©λ‹ˆλ‹€.

더 μžμ„Έν•œ μ„€λͺ…은 μ—¬κΈ°λ₯Ό μ°Έκ³ ν•΄μ£Όμ„Έμš”.

λͺ¨λΈ μš”μ•½

ν•­λͺ© κ°’
νŒŒλΌλ―Έν„° 수 34,217,856
ꡬ쑰 Decoder-only Transformer
λ ˆμ΄μ–΄ 수 10
νžˆλ“  크기 384
Attention heads 6
KV heads 2
FFN 차원 1,536
μ»¨ν…μŠ€νŠΈ 길이 512 tokens
μ–΄νœ˜ 크기 32,768
Embedding tied input/output embeddings
μ–Έμ–΄ ν•œκ΅­μ–΄
λΌμ΄μ„ μŠ€ MIT

ν•™μŠ΅ μš”μ•½

ν•­λͺ© κ°’
ν•™μŠ΅ μ½”νΌμŠ€ Tiny-Ko-Stories
이야기 수 2,003,542
ν† ν¬λ‚˜μ΄μ € 32K Korean tokenizer
ν•™μŠ΅λŸ‰ μ•½ 794.8M tokens
ν•™μŠ΅ 에포크 μ•½ 4.0 epochs
선택 체크포인트 validation-best checkpoint
Best validation loss 2.1168
Best validation perplexity 8.3042

파일 ꡬ성

  • tiny-ko-stories-35m.pt: λͺ¨λΈ 체크포인트
  • model.py: λͺ¨λΈ μ •μ˜
  • generate.py: 생성 슀크립트
  • tokenizer.json: ν† ν¬λ‚˜μ΄μ €
  • vocab.json: ν† ν¬λ‚˜μ΄μ € μ–΄νœ˜
  • model_config.json: λͺ¨λΈ ꡬ쑰 μš”μ•½
  • training_summary.json: ν•™μŠ΅ μš”μ•½
  • tokenizer_summary.json: ν† ν¬λ‚˜μ΄μ € μš”μ•½
  • generation_config.json: 생성 preset μ˜ˆμ‹œ
  • requirements.txt: μ΅œμ†Œ μ‹€ν–‰ μ˜μ‘΄μ„±

μ‚¬μš© μ˜ˆμ‹œ

μ•„λž˜ λͺ…령은 generation_config.json의 balanced preset에 ν•΄λ‹Ήν•˜λŠ” μˆ˜λ™ ν…ŒμŠ€νŠΈμš© μ‹œμž‘μ μž…λ‹ˆλ‹€. κ²€μ¦λœ 졜적 λ””μ½”λ”© 섀정은 μ•„λ‹™λ‹ˆλ‹€.

python generate.py \
  --checkpoint tiny-ko-stories-35m.pt \
  --tokenizer tokenizer.json \
  --device cpu \
  --prompt "μž‘μ€ λ§ˆμ„μ— μ‘°μš©ν•œ 아침이 μ°Ύμ•„μ™”μ–΄μš”." \
  --temperature 0.55 \
  --top-p 0.9 \
  --top-k 40 \
  --repetition-penalty 1.08 \
  --max-new-tokens 180 \
  --seed 42

λ‹€λ₯Έ μ˜ˆμ‹œ ν”„λ‘¬ν”„νŠΈ:

  • λ―Όμ§€λŠ” μž‘μ€ λ…Έλž€ μš°μ‚°μ„ λ“€κ³  λ§ˆλ‹ΉμœΌλ‘œ λ‚˜κ°”μ–΄μš”.
  • μ„œμ€€μ΄λŠ” μšΈκΈ‹λΆˆκΈ‹ν•œ μžŽμ‚¬κ·€ ν•˜λ‚˜λ₯Ό μ°Ύμ•˜μ–΄μš”.

μ˜ˆμ‹œ 좜λ ₯:

μž‘μ€ λ§ˆμ„μ— μ‘°μš©ν•œ 아침이 μ°Ύμ•„μ™”μ–΄μš”. μ„œμ€€μ΄λŠ” 였늘 μΉœκ΅¬λ“€κ³Ό ν•¨κ»˜ λ†€κΈ°λ‘œ ν–ˆμ–΄μš”. 그런데 아침에 신은 양말 색깔이 μ„œλ‘œ λ‹¬λΌμ„œ 쑰금 λΆ€λ„λŸ¬μ› μ–΄μš”.

μ„œμ€€μ΄λŠ” 양말을 숨기렀고 λ°œμ„ 높이 λ“€μ–΄ μ˜¬λ Έμ–΄μš”. ν•˜μ§€λ§Œ μΉœκ΅¬λ“€μ΄ λ‹€κ°€μ˜€μž κ°€μŠ΄μ΄ 콩λ‹₯콩λ‹₯ λ›°μ—ˆμ–΄μš”. μ„œμ€€μ΄λŠ” 용기λ₯Ό λ‚΄μ–΄ 양말이 λ‹€λ₯΄λ‹€κ³  μ†”μ§ν•˜κ²Œ λ§ν–ˆμ–΄μš”.

μΉœκ΅¬λ“€μ€ 였히렀 μ•Œλ‘λ‹¬λ‘ν•œ 양말이 λ©‹μ§€λ‹€λ©° μΉ­μ°¬ν•΄ μ£Όμ—ˆμ–΄μš”. μ„œμ€€μ΄λŠ” 기뢄이 μ’‹μ•„μ Έμ„œ μΉœκ΅¬λ“€μ—κ²Œ λ§›μžˆλŠ” 간식을 λ‚˜λˆ„μ–΄ μ£Όμ—ˆμ–΄μš”. λͺ¨λ‘ ν•¨κ»˜ μ›ƒμœΌλ©° 즐거운 아침을 λ³΄λƒˆμ–΄μš”.

이제 μ„œμ€€μ΄λŠ” 짝짝이 양말이 정말 λ§ˆμŒμ— λ“€μ–΄μš”. μΉœκ΅¬λ“€κ³Ό 손을 작고 λ°–μœΌλ‘œ λ‚˜κ°€ μ‹ λ‚˜κ²Œ λ›°λ†€μ•˜μ–΄μš”. μ„œμ€€μ΄μ˜ 얼꡴에 μžμ‹ κ° λ„˜μΉ˜λŠ” λ―Έμ†Œκ°€ κ°€λ“ν•΄μš”.

ꢌμž₯ μš©λ„

  • ν•œκ΅­μ–΄ μ†Œν˜• μ–Έμ–΄λͺ¨λΈ μ‹€ν—˜
  • 짧은 ν•œκ΅­μ–΄ 이야기 생성 μ‹€ν—˜
  • 어린이 이야기 λ„λ©”μΈμ˜ continuation generation
  • ν•œκ΅­μ–΄ ν† ν¬λ‚˜μ΄μ €μ™€ μ†Œν˜• LM ν•™μŠ΅ 비ꡐ 연ꡬ
  • ꡐ윑용 NLP 데λͺ¨

ꢌμž₯ν•˜μ§€ μ•ŠλŠ” μš©λ„

  • 사싀 μ§ˆμ˜μ‘λ‹΅
  • λ²”μš© ν•œκ΅­μ–΄ 지식 λͺ¨λΈ
  • μ•ˆμ „μ„±μ΄ μ€‘μš”ν•œ μ„œλΉ„μŠ€μ— λ°”λ‘œ μ‚¬μš©
  • μ‚¬λžŒ κ²€μˆ˜ 없이 μ–΄λ¦°μ΄μ—κ²Œ 직접 제곡

μ•Œλ €μ§„ ν•œκ³„

  • 일뢀 좜λ ₯은 이야기 ꡬ쑰가 λ‹¨μˆœν•˜κ±°λ‚˜ λΉ„μŠ·ν•œ κ΅ν›ˆμœΌλ‘œ 끝날 수 μžˆμŠ΅λ‹ˆλ‹€.
  • μ–΄μƒ‰ν•œ 이름, λ‚―μ„  고유λͺ…사, μ•½ν•œ 인과관계, 반볡 ν‘œν˜„μ΄ λ‚˜μ˜¬ 수 μžˆμŠ΅λ‹ˆλ‹€.
  • κ°„ν˜Ή μžμ—°μŠ€λŸ½μ§€ μ•Šμ€ ν•œκ΅­μ–΄λ‚˜ μ–΄μƒ‰ν•œ λ¬Έμž₯이 λ‚˜μ˜¬ 수 μžˆμŠ΅λ‹ˆλ‹€.
  • 좜λ ₯은 μ‚¬μš© λͺ©μ μ— 맞게 κ²€ν† ν•˜λŠ” 것을 ꢌμž₯ν•©λ‹ˆλ‹€.
  • ν¬ν•¨λœ 생성 preset은 μ‹œμž‘μ μΌ 뿐, μ΅œμ κ°’μ΄ μ•„λ‹™λ‹ˆλ‹€.

λΌμ΄μ„ μŠ€

λͺ¨λΈ 파일과 ν•¨κ»˜ μ œκ³΅λ˜λŠ” μ½”λ“œλŠ” MIT License둜 κ³΅κ°œν•©λ‹ˆλ‹€.


Tiny-Ko-Stories-35M

Tiny-Ko-Stories-35M is a 35M-class Korean continuation LM trained on the Tiny-Ko-Stories corpus.

This is not an instruction-tuned assistant. It continues a story from an opening sentence.

Tiny-Ko-Stories is not a translation of the English TinyStories dataset. It was generated and refined directly in Korean to include Korean names, sentence rhythm, mimetic and ideophonic words, color expressions, and small problem-resolution story structures.

This repository provides a custom PyTorch checkpoint, model definition, tokenizer, and generation script.

Model Summary

Field Value
Parameters 34,217,856
Architecture Decoder-only Transformer
Layers 10
Hidden size 384
Attention heads 6
KV heads 2
FFN dimension 1,536
Context length 512 tokens
Vocabulary size 32,768
Embedding tied input/output embeddings
Language Korean
License MIT

Training Summary

Field Value
Training corpus Tiny-Ko-Stories
Stories 2,003,542
Tokenizer 32K Korean tokenizer
Training tokens about 794.8M tokens
Training epochs about 4.0 epochs
Selected checkpoint validation-best checkpoint
Best validation loss 2.1168
Best validation perplexity 8.3042

Files

  • tiny-ko-stories-35m.pt: model checkpoint
  • model.py: model definition
  • generate.py: generation script
  • tokenizer.json: tokenizer
  • vocab.json: tokenizer vocabulary
  • model_config.json: model architecture summary
  • training_summary.json: compact training summary
  • tokenizer_summary.json: compact tokenizer summary
  • generation_config.json: sample generation presets
  • requirements.txt: minimal Python requirements

Usage

The command below uses the balanced preset from generation_config.json. It is a reasonable starting point for manual testing, not a proven optimal decoding setting.

python generate.py \
  --checkpoint tiny-ko-stories-35m.pt \
  --tokenizer tokenizer.json \
  --device cpu \
  --prompt "μž‘μ€ λ§ˆμ„μ— μ‘°μš©ν•œ 아침이 μ°Ύμ•„μ™”μ–΄μš”." \
  --temperature 0.55 \
  --top-p 0.9 \
  --top-k 40 \
  --repetition-penalty 1.08 \
  --max-new-tokens 180 \
  --seed 42

Other example prompts:

  • λ―Όμ§€λŠ” 아침에 μž‘μ€ λ…Έλž€ μš°μ‚°μ„ λ“€κ³  λ§ˆλ‹ΉμœΌλ‘œ λ‚˜κ°”μ–΄μš”.
  • μ„œμ€€μ΄λŠ” μšΈκΈ‹λΆˆκΈ‹ν•œ μžŽμ‚¬κ·€ ν•˜λ‚˜λ₯Ό μ°Ύμ•˜μ–΄μš”.

Example output:

μž‘μ€ λ§ˆμ„μ— μ‘°μš©ν•œ 아침이 μ°Ύμ•„μ™”μ–΄μš”. μ„œμ€€μ΄λŠ” 였늘 μΉœκ΅¬λ“€κ³Ό ν•¨κ»˜ λ†€κΈ°λ‘œ ν–ˆμ–΄μš”. 그런데 아침에 신은 양말 색깔이 μ„œλ‘œ λ‹¬λΌμ„œ 쑰금 λΆ€λ„λŸ¬μ› μ–΄μš”.

μ„œμ€€μ΄λŠ” 양말을 숨기렀고 λ°œμ„ 높이 λ“€μ–΄ μ˜¬λ Έμ–΄μš”. ν•˜μ§€λ§Œ μΉœκ΅¬λ“€μ΄ λ‹€κ°€μ˜€μž κ°€μŠ΄μ΄ 콩λ‹₯콩λ‹₯ λ›°μ—ˆμ–΄μš”. μ„œμ€€μ΄λŠ” 용기λ₯Ό λ‚΄μ–΄ 양말이 λ‹€λ₯΄λ‹€κ³  μ†”μ§ν•˜κ²Œ λ§ν–ˆμ–΄μš”.

μΉœκ΅¬λ“€μ€ 였히렀 μ•Œλ‘λ‹¬λ‘ν•œ 양말이 λ©‹μ§€λ‹€λ©° μΉ­μ°¬ν•΄ μ£Όμ—ˆμ–΄μš”. μ„œμ€€μ΄λŠ” 기뢄이 μ’‹μ•„μ Έμ„œ μΉœκ΅¬λ“€μ—κ²Œ λ§›μžˆλŠ” 간식을 λ‚˜λˆ„μ–΄ μ£Όμ—ˆμ–΄μš”. λͺ¨λ‘ ν•¨κ»˜ μ›ƒμœΌλ©° 즐거운 아침을 λ³΄λƒˆμ–΄μš”.

이제 μ„œμ€€μ΄λŠ” 짝짝이 양말이 정말 λ§ˆμŒμ— λ“€μ–΄μš”. μΉœκ΅¬λ“€κ³Ό 손을 작고 λ°–μœΌλ‘œ λ‚˜κ°€ μ‹ λ‚˜κ²Œ λ›°λ†€μ•˜μ–΄μš”. μ„œμ€€μ΄μ˜ 얼꡴에 μžμ‹ κ° λ„˜μΉ˜λŠ” λ―Έμ†Œκ°€ κ°€λ“ν•΄μš”.

Intended Uses

  • Small Korean language model experiments
  • Short Korean story generation experiments
  • Continuation generation in a child-friendly story domain
  • Comparative studies on Korean tokenizers and small LMs
  • Educational NLP demos

Out-of-Scope Uses

  • Factual question answering
  • General-purpose Korean knowledge modeling
  • Direct use in safety-critical services
  • Direct use as children's reading material without human review

Known Limitations

  • Some outputs may have simple structures or similar moral endings.
  • Awkward names, unusual proper nouns, weak causal links, or repeated phrases may appear.
  • Some outputs may contain unnatural Korean or awkward sentences.
  • Outputs should be reviewed depending on the use case.
  • The included generation presets are starting points, not optimal settings.

License

The model files and accompanying code are released under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support