Tiny-Ko-Stories-35M
English version is available below.
Tiny-Ko-Stories-35Mμ Tiny-Ko-Stories μ½νΌμ€λ‘ νμ΅ν 35MκΈ νκ΅μ΄ continuation LMμ λλ€.
μ΄ λͺ¨λΈμ instruction-tuned assistantκ° μλλΌ, μ΄μΌκΈ°μ 첫 λ¬Έμ₯μ λ£μΌλ©΄ λ€λ₯Ό μ΄μ΄ μ°λ λͺ¨λΈμ λλ€.
Tiny-Ko-Storiesλ μμ΄ TinyStoriesλ₯Ό λ²μν λ°μ΄ν°μ μ΄ μλλλ€. νκ΅μ΄λ€μ΄ μ΄λ¦, λ¬Έμ₯ 리λ¬, μμ±μ΄μ μνμ΄, μμ±μ΄, μμ λ¬Έμ -ν΄κ²° ꡬ쑰λ₯Ό ν¬ν¨νκΈ° μν΄ μ²μλΆν° νκ΅μ΄λ‘ μμ±νκ³ μ μ ν μ΄μΌκΈ° μ½νΌμ€μ λλ€.
μ΄ μ μ₯μλ 컀μ€ν PyTorch 체ν¬ν¬μΈνΈμ λͺ¨λΈ μ μ, ν ν¬λμ΄μ , μμ± μ€ν¬λ¦½νΈλ₯Ό ν¨κ» μ 곡ν©λλ€.
λ μμΈν μ€λͺ μ μ¬κΈ°λ₯Ό μ°Έκ³ ν΄μ£ΌμΈμ.
λͺ¨λΈ μμ½
| νλͺ© | κ° |
|---|---|
| νλΌλ―Έν° μ | 34,217,856 |
| ꡬ쑰 | Decoder-only Transformer |
| λ μ΄μ΄ μ | 10 |
| νλ ν¬κΈ° | 384 |
| Attention heads | 6 |
| KV heads | 2 |
| FFN μ°¨μ | 1,536 |
| 컨ν μ€νΈ κΈΈμ΄ | 512 tokens |
| μ΄ν ν¬κΈ° | 32,768 |
| Embedding | tied input/output embeddings |
| μΈμ΄ | νκ΅μ΄ |
| λΌμ΄μ μ€ | MIT |
νμ΅ μμ½
| νλͺ© | κ° |
|---|---|
| νμ΅ μ½νΌμ€ | Tiny-Ko-Stories |
| μ΄μΌκΈ° μ | 2,003,542 |
| ν ν¬λμ΄μ | 32K Korean tokenizer |
| νμ΅λ | μ½ 794.8M tokens |
| νμ΅ μν¬ν¬ | μ½ 4.0 epochs |
| μ ν 체ν¬ν¬μΈνΈ | validation-best checkpoint |
| Best validation loss | 2.1168 |
| Best validation perplexity | 8.3042 |
νμΌ κ΅¬μ±
tiny-ko-stories-35m.pt: λͺ¨λΈ 체ν¬ν¬μΈνΈmodel.py: λͺ¨λΈ μ μgenerate.py: μμ± μ€ν¬λ¦½νΈtokenizer.json: ν ν¬λμ΄μ vocab.json: ν ν¬λμ΄μ μ΄νmodel_config.json: λͺ¨λΈ ꡬ쑰 μμ½training_summary.json: νμ΅ μμ½tokenizer_summary.json: ν ν¬λμ΄μ μμ½generation_config.json: μμ± preset μμrequirements.txt: μ΅μ μ€ν μμ‘΄μ±
μ¬μ© μμ
μλ λͺ
λ Ήμ generation_config.jsonμ balanced presetμ ν΄λΉνλ μλ ν
μ€νΈμ© μμμ μ
λλ€. κ²μ¦λ μ΅μ λμ½λ© μ€μ μ μλλλ€.
python generate.py \
--checkpoint tiny-ko-stories-35m.pt \
--tokenizer tokenizer.json \
--device cpu \
--prompt "μμ λ§μμ μ‘°μ©ν μμΉ¨μ΄ μ°Ύμμμ΄μ." \
--temperature 0.55 \
--top-p 0.9 \
--top-k 40 \
--repetition-penalty 1.08 \
--max-new-tokens 180 \
--seed 42
λ€λ₯Έ μμ ν둬ννΈ:
λ―Όμ§λ μμ λ Έλ μ°μ°μ λ€κ³ λ§λΉμΌλ‘ λκ°μ΄μ.μμ€μ΄λ μΈκΈλΆκΈν μμ¬κ· νλλ₯Ό μ°Ύμμ΄μ.
μμ μΆλ ₯:
μμ λ§μμ μ‘°μ©ν μμΉ¨μ΄ μ°Ύμμμ΄μ. μμ€μ΄λ μ€λ μΉκ΅¬λ€κ³Ό ν¨κ» λκΈ°λ‘ νμ΄μ. κ·Έλ°λ° μμΉ¨μ μ μ μλ§ μκΉμ΄ μλ‘ λ¬λΌμ μ‘°κΈ λΆλλ¬μ μ΄μ.
μμ€μ΄λ μλ§μ μ¨κΈ°λ €κ³ λ°μ λμ΄ λ€μ΄ μ¬λ Έμ΄μ. νμ§λ§ μΉκ΅¬λ€μ΄ λ€κ°μ€μ κ°μ΄μ΄ 콩λ₯콩λ₯ λ°μμ΄μ. μμ€μ΄λ μ©κΈ°λ₯Ό λ΄μ΄ μλ§μ΄ λ€λ₯΄λ€κ³ μμ§νκ² λ§νμ΄μ.
μΉκ΅¬λ€μ μ€νλ € μλ‘λ¬λ‘ν μλ§μ΄ λ©μ§λ€λ©° μΉμ°¬ν΄ μ£Όμμ΄μ. μμ€μ΄λ κΈ°λΆμ΄ μ’μμ Έμ μΉκ΅¬λ€μκ² λ§μλ κ°μμ λλμ΄ μ£Όμμ΄μ. λͺ¨λ ν¨κ» μμΌλ©° μ¦κ±°μ΄ μμΉ¨μ 보λμ΄μ.
μ΄μ μμ€μ΄λ μ§μ§μ΄ μλ§μ΄ μ λ§ λ§μμ λ€μ΄μ. μΉκ΅¬λ€κ³Ό μμ μ‘κ³ λ°μΌλ‘ λκ° μ λκ² λ°λμμ΄μ. μμ€μ΄μ μΌκ΅΄μ μμ κ° λμΉλ λ―Έμκ° κ°λν΄μ.
κΆμ₯ μ©λ
- νκ΅μ΄ μν μΈμ΄λͺ¨λΈ μ€ν
- μ§§μ νκ΅μ΄ μ΄μΌκΈ° μμ± μ€ν
- μ΄λ¦°μ΄ μ΄μΌκΈ° λλ©μΈμ continuation generation
- νκ΅μ΄ ν ν¬λμ΄μ μ μν LM νμ΅ λΉκ΅ μ°κ΅¬
- κ΅μ‘μ© NLP λ°λͺ¨
κΆμ₯νμ§ μλ μ©λ
- μ¬μ€ μ§μμλ΅
- λ²μ© νκ΅μ΄ μ§μ λͺ¨λΈ
- μμ μ±μ΄ μ€μν μλΉμ€μ λ°λ‘ μ¬μ©
- μ¬λ κ²μ μμ΄ μ΄λ¦°μ΄μκ² μ§μ μ 곡
μλ €μ§ νκ³
- μΌλΆ μΆλ ₯μ μ΄μΌκΈ° κ΅¬μ‘°κ° λ¨μνκ±°λ λΉμ·ν κ΅νμΌλ‘ λλ μ μμ΅λλ€.
- μ΄μν μ΄λ¦, λ―μ κ³ μ λͺ μ¬, μ½ν μΈκ³Όκ΄κ³, λ°λ³΅ ννμ΄ λμ¬ μ μμ΅λλ€.
- κ°νΉ μμ°μ€λ½μ§ μμ νκ΅μ΄λ μ΄μν λ¬Έμ₯μ΄ λμ¬ μ μμ΅λλ€.
- μΆλ ₯μ μ¬μ© λͺ©μ μ λ§κ² κ²ν νλ κ²μ κΆμ₯ν©λλ€.
- ν¬ν¨λ μμ± presetμ μμμ μΌ λΏ, μ΅μ κ°μ΄ μλλλ€.
λΌμ΄μ μ€
λͺ¨λΈ νμΌκ³Ό ν¨κ» μ 곡λλ μ½λλ MIT Licenseλ‘ κ³΅κ°ν©λλ€.
Tiny-Ko-Stories-35M
Tiny-Ko-Stories-35M is a 35M-class Korean continuation LM trained on the Tiny-Ko-Stories corpus.
This is not an instruction-tuned assistant. It continues a story from an opening sentence.
Tiny-Ko-Stories is not a translation of the English TinyStories dataset. It was generated and refined directly in Korean to include Korean names, sentence rhythm, mimetic and ideophonic words, color expressions, and small problem-resolution story structures.
This repository provides a custom PyTorch checkpoint, model definition, tokenizer, and generation script.
Model Summary
| Field | Value |
|---|---|
| Parameters | 34,217,856 |
| Architecture | Decoder-only Transformer |
| Layers | 10 |
| Hidden size | 384 |
| Attention heads | 6 |
| KV heads | 2 |
| FFN dimension | 1,536 |
| Context length | 512 tokens |
| Vocabulary size | 32,768 |
| Embedding | tied input/output embeddings |
| Language | Korean |
| License | MIT |
Training Summary
| Field | Value |
|---|---|
| Training corpus | Tiny-Ko-Stories |
| Stories | 2,003,542 |
| Tokenizer | 32K Korean tokenizer |
| Training tokens | about 794.8M tokens |
| Training epochs | about 4.0 epochs |
| Selected checkpoint | validation-best checkpoint |
| Best validation loss | 2.1168 |
| Best validation perplexity | 8.3042 |
Files
tiny-ko-stories-35m.pt: model checkpointmodel.py: model definitiongenerate.py: generation scripttokenizer.json: tokenizervocab.json: tokenizer vocabularymodel_config.json: model architecture summarytraining_summary.json: compact training summarytokenizer_summary.json: compact tokenizer summarygeneration_config.json: sample generation presetsrequirements.txt: minimal Python requirements
Usage
The command below uses the balanced preset from generation_config.json. It is a reasonable starting point for manual testing, not a proven optimal decoding setting.
python generate.py \
--checkpoint tiny-ko-stories-35m.pt \
--tokenizer tokenizer.json \
--device cpu \
--prompt "μμ λ§μμ μ‘°μ©ν μμΉ¨μ΄ μ°Ύμμμ΄μ." \
--temperature 0.55 \
--top-p 0.9 \
--top-k 40 \
--repetition-penalty 1.08 \
--max-new-tokens 180 \
--seed 42
Other example prompts:
λ―Όμ§λ μμΉ¨μ μμ λ Έλ μ°μ°μ λ€κ³ λ§λΉμΌλ‘ λκ°μ΄μ.μμ€μ΄λ μΈκΈλΆκΈν μμ¬κ· νλλ₯Ό μ°Ύμμ΄μ.
Example output:
μμ λ§μμ μ‘°μ©ν μμΉ¨μ΄ μ°Ύμμμ΄μ. μμ€μ΄λ μ€λ μΉκ΅¬λ€κ³Ό ν¨κ» λκΈ°λ‘ νμ΄μ. κ·Έλ°λ° μμΉ¨μ μ μ μλ§ μκΉμ΄ μλ‘ λ¬λΌμ μ‘°κΈ λΆλλ¬μ μ΄μ.
μμ€μ΄λ μλ§μ μ¨κΈ°λ €κ³ λ°μ λμ΄ λ€μ΄ μ¬λ Έμ΄μ. νμ§λ§ μΉκ΅¬λ€μ΄ λ€κ°μ€μ κ°μ΄μ΄ 콩λ₯콩λ₯ λ°μμ΄μ. μμ€μ΄λ μ©κΈ°λ₯Ό λ΄μ΄ μλ§μ΄ λ€λ₯΄λ€κ³ μμ§νκ² λ§νμ΄μ.
μΉκ΅¬λ€μ μ€νλ € μλ‘λ¬λ‘ν μλ§μ΄ λ©μ§λ€λ©° μΉμ°¬ν΄ μ£Όμμ΄μ. μμ€μ΄λ κΈ°λΆμ΄ μ’μμ Έμ μΉκ΅¬λ€μκ² λ§μλ κ°μμ λλμ΄ μ£Όμμ΄μ. λͺ¨λ ν¨κ» μμΌλ©° μ¦κ±°μ΄ μμΉ¨μ 보λμ΄μ.
μ΄μ μμ€μ΄λ μ§μ§μ΄ μλ§μ΄ μ λ§ λ§μμ λ€μ΄μ. μΉκ΅¬λ€κ³Ό μμ μ‘κ³ λ°μΌλ‘ λκ° μ λκ² λ°λμμ΄μ. μμ€μ΄μ μΌκ΅΄μ μμ κ° λμΉλ λ―Έμκ° κ°λν΄μ.
Intended Uses
- Small Korean language model experiments
- Short Korean story generation experiments
- Continuation generation in a child-friendly story domain
- Comparative studies on Korean tokenizers and small LMs
- Educational NLP demos
Out-of-Scope Uses
- Factual question answering
- General-purpose Korean knowledge modeling
- Direct use in safety-critical services
- Direct use as children's reading material without human review
Known Limitations
- Some outputs may have simple structures or similar moral endings.
- Awkward names, unusual proper nouns, weak causal links, or repeated phrases may appear.
- Some outputs may contain unnatural Korean or awkward sentences.
- Outputs should be reviewed depending on the use case.
- The included generation presets are starting points, not optimal settings.
License
The model files and accompanying code are released under the MIT License.