Tiny-Ko-Stories-35M

English version is available below.

Tiny-Ko-Stories-35M은 Tiny-Ko-Stories 코퍼스로 학습한 35M급 한국어 continuation LM입니다.

이 모델은 instruction-tuned assistant가 아니라, 이야기의 첫 문장을 넣으면 뒤를 이어 쓰는 모델입니다.

Tiny-Ko-Stories는 영어 TinyStories를 번역한 데이터셋이 아닙니다. 한국어다운 이름, 문장 리듬, 의성어와 의태어, 색채어, 작은 문제-해결 구조를 포함하기 위해 처음부터 한국어로 생성하고 정제한 이야기 코퍼스입니다.

이 저장소는 커스텀 PyTorch 체크포인트와 모델 정의, 토크나이저, 생성 스크립트를 함께 제공합니다.

더 자세한 설명은 여기를 참고해주세요.

모델 요약

항목	값
파라미터 수	34,217,856
구조	Decoder-only Transformer
레이어 수	10
히든 크기	384
Attention heads	6
KV heads	2
FFN 차원	1,536
컨텍스트 길이	512 tokens
어휘 크기	32,768
Embedding	tied input/output embeddings
언어	한국어
라이선스	MIT

학습 요약

항목	값
학습 코퍼스	Tiny-Ko-Stories
이야기 수	2,003,542
토크나이저	32K Korean tokenizer
학습량	약 794.8M tokens
학습 에포크	약 4.0 epochs
선택 체크포인트	validation-best checkpoint
Best validation loss	2.1168
Best validation perplexity	8.3042

파일 구성

tiny-ko-stories-35m.pt: 모델 체크포인트
model.py: 모델 정의
generate.py: 생성 스크립트
tokenizer.json: 토크나이저
vocab.json: 토크나이저 어휘
model_config.json: 모델 구조 요약
training_summary.json: 학습 요약
tokenizer_summary.json: 토크나이저 요약
generation_config.json: 생성 preset 예시
requirements.txt: 최소 실행 의존성

사용 예시

아래 명령은 generation_config.json의 balanced preset에 해당하는 수동 테스트용 시작점입니다. 검증된 최적 디코딩 설정은 아닙니다.

python generate.py \
  --checkpoint tiny-ko-stories-35m.pt \
  --tokenizer tokenizer.json \
  --device cpu \
  --prompt "작은 마을에 조용한 아침이 찾아왔어요." \
  --temperature 0.55 \
  --top-p 0.9 \
  --top-k 40 \
  --repetition-penalty 1.08 \
  --max-new-tokens 180 \
  --seed 42

다른 예시 프롬프트:

민지는 작은 노란 우산을 들고 마당으로 나갔어요.
서준이는 울긋불긋한 잎사귀 하나를 찾았어요.

예시 출력:

작은 마을에 조용한 아침이 찾아왔어요. 서준이는 오늘 친구들과 함께 놀기로 했어요. 그런데 아침에 신은 양말 색깔이 서로 달라서 조금 부끄러웠어요.

서준이는 양말을 숨기려고 발을 높이 들어 올렸어요. 하지만 친구들이 다가오자 가슴이 콩닥콩닥 뛰었어요. 서준이는 용기를 내어 양말이 다르다고 솔직하게 말했어요.

친구들은 오히려 알록달록한 양말이 멋지다며 칭찬해 주었어요. 서준이는 기분이 좋아져서 친구들에게 맛있는 간식을 나누어 주었어요. 모두 함께 웃으며 즐거운 아침을 보냈어요.

이제 서준이는 짝짝이 양말이 정말 마음에 들어요. 친구들과 손을 잡고 밖으로 나가 신나게 뛰놀았어요. 서준이의 얼굴에 자신감 넘치는 미소가 가득해요.

권장 용도

한국어 소형 언어모델 실험
짧은 한국어 이야기 생성 실험
어린이 이야기 도메인의 continuation generation
한국어 토크나이저와 소형 LM 학습 비교 연구
교육용 NLP 데모

권장하지 않는 용도

사실 질의응답
범용 한국어 지식 모델
안전성이 중요한 서비스에 바로 사용
사람 검수 없이 어린이에게 직접 제공

알려진 한계

일부 출력은 이야기 구조가 단순하거나 비슷한 교훈으로 끝날 수 있습니다.
어색한 이름, 낯선 고유명사, 약한 인과관계, 반복 표현이 나올 수 있습니다.
간혹 자연스럽지 않은 한국어나 어색한 문장이 나올 수 있습니다.
출력은 사용 목적에 맞게 검토하는 것을 권장합니다.
포함된 생성 preset은 시작점일 뿐, 최적값이 아닙니다.

라이선스

모델 파일과 함께 제공되는 코드는 MIT License로 공개합니다.

Tiny-Ko-Stories-35M

Tiny-Ko-Stories-35M is a 35M-class Korean continuation LM trained on the Tiny-Ko-Stories corpus.

This is not an instruction-tuned assistant. It continues a story from an opening sentence.

Tiny-Ko-Stories is not a translation of the English TinyStories dataset. It was generated and refined directly in Korean to include Korean names, sentence rhythm, mimetic and ideophonic words, color expressions, and small problem-resolution story structures.

This repository provides a custom PyTorch checkpoint, model definition, tokenizer, and generation script.

Model Summary

Field	Value
Parameters	34,217,856
Architecture	Decoder-only Transformer
Layers	10
Hidden size	384
Attention heads	6
KV heads	2
FFN dimension	1,536
Context length	512 tokens
Vocabulary size	32,768
Embedding	tied input/output embeddings
Language	Korean
License	MIT

Training Summary

Field	Value
Training corpus	Tiny-Ko-Stories
Stories	2,003,542
Tokenizer	32K Korean tokenizer
Training tokens	about 794.8M tokens
Training epochs	about 4.0 epochs
Selected checkpoint	validation-best checkpoint
Best validation loss	2.1168
Best validation perplexity	8.3042

Files

tiny-ko-stories-35m.pt: model checkpoint
model.py: model definition
generate.py: generation script
tokenizer.json: tokenizer
vocab.json: tokenizer vocabulary
model_config.json: model architecture summary
training_summary.json: compact training summary
tokenizer_summary.json: compact tokenizer summary
generation_config.json: sample generation presets
requirements.txt: minimal Python requirements

Usage

The command below uses the balanced preset from generation_config.json. It is a reasonable starting point for manual testing, not a proven optimal decoding setting.

python generate.py \
  --checkpoint tiny-ko-stories-35m.pt \
  --tokenizer tokenizer.json \
  --device cpu \
  --prompt "작은 마을에 조용한 아침이 찾아왔어요." \
  --temperature 0.55 \
  --top-p 0.9 \
  --top-k 40 \
  --repetition-penalty 1.08 \
  --max-new-tokens 180 \
  --seed 42

Other example prompts:

민지는 아침에 작은 노란 우산을 들고 마당으로 나갔어요.
서준이는 울긋불긋한 잎사귀 하나를 찾았어요.

Example output:

작은 마을에 조용한 아침이 찾아왔어요. 서준이는 오늘 친구들과 함께 놀기로 했어요. 그런데 아침에 신은 양말 색깔이 서로 달라서 조금 부끄러웠어요.

서준이는 양말을 숨기려고 발을 높이 들어 올렸어요. 하지만 친구들이 다가오자 가슴이 콩닥콩닥 뛰었어요. 서준이는 용기를 내어 양말이 다르다고 솔직하게 말했어요.

친구들은 오히려 알록달록한 양말이 멋지다며 칭찬해 주었어요. 서준이는 기분이 좋아져서 친구들에게 맛있는 간식을 나누어 주었어요. 모두 함께 웃으며 즐거운 아침을 보냈어요.

이제 서준이는 짝짝이 양말이 정말 마음에 들어요. 친구들과 손을 잡고 밖으로 나가 신나게 뛰놀았어요. 서준이의 얼굴에 자신감 넘치는 미소가 가득해요.

Intended Uses

Small Korean language model experiments
Short Korean story generation experiments
Continuation generation in a child-friendly story domain
Comparative studies on Korean tokenizers and small LMs
Educational NLP demos

Out-of-Scope Uses

Factual question answering
General-purpose Korean knowledge modeling
Direct use in safety-critical services
Direct use as children's reading material without human review

Known Limitations

Some outputs may have simple structures or similar moral endings.
Awkward names, unusual proper nouns, weak causal links, or repeated phrases may appear.
Some outputs may contain unnatural Korean or awkward sentences.
Outputs should be reviewed depending on the use case.
The included generation presets are starting points, not optimal settings.

License

The model files and accompanying code are released under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track