TinyBuddy-80K

πŸ† RECORD ATTEMPT: The smallest functional English-speaking language model on Hugging Face. 83,856 parameters β€” that's ~84K, beating the NaA-IA/Small-ever record by being both tiny AND coherent.

Mission: Prove that under 100K parameters, a language model can still learn English patterns and generate recognizable text. This is not just the smallest β€” it's the smallest that works.


Model Details

Property Value
Parameters 83,856 (~84K)
Layers 1
Hidden size 48
Attention heads 4 (query) / 2 (key-value) = GQA
FF intermediate size 192
Context length 128
Vocabulary 1,024 tokens (BPE)
Architecture Llama-style: RMSNorm, RoPE, SiLU/SwiGLU, tied embeddings
Precision float32

Parameter Breakdown

Component Parameters
Token Embedding (tied) 49,152
Attention (Q/K/V/O) 5,760
FeedForward (Gate/Up/Down) 27,648
LayerNorm (3Γ— RMSNorm) 144
Total 83,856

Architecture

TinyBuddy-100K uses a single transformer block with:

  • RMSNorm (pre-norm) β€” efficient normalization
  • Grouped Query Attention β€” 4 query heads, 2 KV heads (saves params)
  • RoPE (Rotary Position Embeddings) β€” relative position encoding
  • SwiGLU (SiLU-gated MLP) β€” modern activation
  • Tied embeddings β€” input and output share weights (saves ~49K params!)
Input β†’ Embedding β†’ [RMSNorm β†’ GQA Attention β†’ +] β†’ [RMSNorm β†’ SwiGLU FFN β†’ +] β†’ RMSNorm β†’ LM Head β†’ Output

Training

  • Dataset: TinyStories (~5,000 stories)
  • Tokenizer: Byte-level BPE, 1,024 vocabulary (trained from scratch)
  • Optimizer: AdamW (lr=5e-3, weight_decay=0.1)
  • Schedule: Warmup (50 steps) + Cosine decay
  • Steps: 1,000 on CPU
  • Hardware: Single CPU core (the challenge!)

Usage

import torch
from model import create_model

# Load config
import json
with open("config.json") as f:
    config = json.load(f)

# Create model
model = create_model(config)
model.load_state_dict(torch.load("output/model.pt", map_location="cpu"))
model.eval()

# Generate
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("data/tokenizer.json")

prompt = "Once upon a time,"
encoded = tokenizer.encode(prompt)
ids = [1] + encoded.ids  # Add BOS
input_ids = torch.tensor([ids], dtype=torch.long)

output_ids = model.generate(input_ids, max_new_tokens=60, temperature=0.8, top_k=40)
print(tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True))

Limitations

This model is extremely small β€” it has fewer parameters than a 28Γ—28 grayscale image.

What works:

  • Basic word patterns and short phrases
  • Recognizable English-like structure
  • Story-like opening sentences

What's broken:

  • Very limited coherence (1–2 sentences max)
  • High repetition
  • No factual knowledge or reasoning
  • Limited vocabulary diversity

This model exists purely to explore the lower bounds of language modeling. It proves that even at 84K parameters, a neural network can capture statistical patterns in English text.


The Record

Model Parameters Speaks English?
NaA-IA/Small-ever 112 ❌ No
TinyBuddy-80K 83,856 βœ… YES

TinyBuddy-100K may not be the absolute smallest model ever, but it's the smallest that actually generates recognizable English text. That's the real achievement.


Citation

@misc{tinybuddy100k,
  title  = {TinyBuddy-100K: An 84K parameter Llama-style model that speaks English},
  year   = {2026},
  note   = {Record attempt: smallest functional English text generator.}
}

LONG LIVE TINYBUDDY-80K πŸš€

Downloads last month
-
Safetensors
Model size
151k params
Tensor type
F32
Β·
BOOL
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support