TinyStories Small Language Model (RoPE + SwiGLU)

A ~30M parameter language model trained from scratch on the TinyStories dataset.

Architecture

Component Choice Why
Position Encoding RoPE Used in LLaMA, Mistral, Gemma — better length generalization
Activation SwiGLU Used in LLaMA, PaLM — better gradient flow than GeLU
Normalization RMSNorm Used in LLaMA — faster than LayerNorm
Attention Flash Attention (PyTorch 2.0+) Memory efficient causal attention

Parameters

  • Total: 29.92M
  • Layers: 6 | Heads: 6 | d_model: 384
  • Context window: 256 tokens

Training

  • Dataset: TinyStories (~2.1M short stories)
  • Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
  • LR Schedule: Linear warmup + Cosine decay
  • Mixed precision: bfloat16
  • Best validation loss: 1.6472 | Perplexity: 5.2

Sample Output

Prompt: "Once upon a time there was a little girl named Lily"

[Add sample output after training]

Usage

import torch
import tiktoken
from huggingface_hub import hf_hub_download

# Load weights
weights_path = hf_hub_download(repo_id="Manushi0304/tinystories-slm-rope", filename="pytorch_model.bin")
# Load config and rebuild model, then load state dict
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support