⚡ BabyKilat

A Lightweight MoE Language Model with Hybrid Attention

140M parameters | Trained on ~1B tokens | 6,000 steps | Free to use

Model Description

BabyKilat is a small language model featuring:

Hybrid Attention — Combines efficient global decay heads with precise latent MLA heads
Mixture-of-Experts (MoE) — 4 experts (1 shared + 3 routed), 2 active per token
No Positional Encoding (NoPE) — Relies on attention structure rather than explicit position embeddings
Lightweight Design — Runs on consumer GPUs with just 6GB VRAM (RTX 4050 compatible)
Built from Scratch — Trained using the custom Kilat framework

Architecture Details

Parameter	Value
Embedding Dimension	512
Number of Layers	8
Attention Heads	8
Head Dimension	64
Latent Dimension	128
Sequence Length	2048
Vocabulary Size	50,257 (GPT-2)
Total Parameters	140.9M
MoE Experts	4 (2 active, 1 shared)
Recall Ratio	0.5 (balanced)
FF Multiplier	2.667
Positional Encoding	None (NoPE)

Hybrid Attention Mechanism

Each layer splits heads into two parallel paths:

                    Input x [B, N, D]
                           │
           ┌───────────────┴────────────────┐
           │                                │
           ▼                                ▼
   ╔═══════════════════╗           ╔════════════════════╗
   ║   PATH 1          ║           ║  PATH 2           ║
   ║   Global Decay    ║           ║   Latent MLA       ║
   ║   (O(N) linear)   ║           ║   (O(N²) exact)    ║
   ╚═══════════════════╝           ╚════════════════════╝
           │                                │
           ▼                                ▼
    Efficient recurrent                  High-precision
    state update                         full attention
           │                                │
           └───────────────┬────────────────┘
                           │
                           ▼
                    Learned gate fusion
                    (per-head, per-token)

Global Decay Path — Linear-time attention with exponential decay, O(1) KV-cache per step
Latent MLA Path — Standard softmax attention with 4x compressed KV-cache
Recall Ratio — 0.5 means half of heads use each path, balanced for quality/efficiency

Quick Start

Option 1: Using Kilat (Recommended)

pip install git+https://github.com/Airukua/kilat.git

from kilat import KilatTransformer
from kilat.data import AutoTokenizer
import torch

model = KilatTransformer.from_pretrained("AiRukua/BabyKilat")
tokenizer = AutoTokenizer.from_pretrained("AiRukua/BabyKilat")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

prompt = "The future of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    repetition_penalty=1.3,
    top_k=50,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 2: Using TextGenerator Wrapper

from kilat import KilatTransformer, TextGenerator
from kilat.data import AutoTokenizer

model = KilatTransformer.from_pretrained("AiRukua/BabyKilat")
tokenizer = AutoTokenizer.from_pretrained("AiRukua/BabyKilat")
generator = TextGenerator(model, tokenizer)

text = generator.generate(
    "The future of artificial intelligence",
    max_new_tokens=80,
    temperature=0.9,
    do_sample=True,
)
print(text)

Training Details

Want to train it yourself? Google Colab notebook available: 👉 https://colab.research.google.com/drive/1dI8NjMjxzkBBnlkzurE-OlGtKhA14vuo?usp=sharing

Dataset

Source: FineWeb-Edu (educational content)
Format: Memory-mapped numpy arrays
Sequence Length: 2048 tokens
Tokens Seen: ~123M tokens (6,000 steps × 32 batch × 8 grad accum × 2048 seq len / 1M)

Training Configuration

Hyperparameter	Value
Optimizer	AdamW (β1=0.9, β2=0.95)
Peak Learning Rate	6e-3
Weight Decay	0.1
Gradient Clipping	1.0
Batch Size	32 × 8 grad accum = 256 effective
Sequence Length	2048
Warmup Steps	1,000
Scheduler	WSDLR (5,000 decay steps)
Precision	BF16
Steps Trained	6,000 / 50,000

Training Results (at step 6,000)

Metric	Value
Train Loss	~3.33
Eval Loss	3.348
Perplexity	28.44

Hardware & Estimated Training Time

GPU	VRAM	Estimated Time
NVIDIA RTX 6000 Blackwell	96GB	~7 hours
RTX 4050	6GB	~3–4 days
T4 Google Colab	16GB	~1–2 days

Generation Strategies

Strategy	Parameters	Use Case
Greedy	`do_sample=False`	Deterministic, factual tasks
Sampling	`do_sample=True, temperature=0.8`	Creative writing
Top-K	`top_k=50`	Focused diversity
Top-P	`top_p=0.95`	Nucleus sampling

Requirements

pip install git+https://github.com/Airukua/kilat.git
pip install torch transformers sentencepiece safetensors

Python 3.10+
PyTorch 2.0+

Repository Files

BabyKilat/
├── config.json              # Model configuration
├── config.yaml              # Human-readable config
├── model.safetensors        # Model weights (140M params)
├── tokenizer_config.json    # Tokenizer config
├── tokenizer.json           # GPT-2 tokenizer
├── vocab.json               # Vocabulary
└── merges.txt               # BPE merges

Limitations

English only. Trained exclusively on English text.
No instruction tuning. Base language model only, not aligned for chat or task completion.
Partially trained. This checkpoint represents 6,000 of 50,000 planned steps (~12% of full training). Quality will improve significantly with full training.
Limited capacity. At 140M parameters, the model underperforms larger models on knowledge-intensive benchmarks. Outputs may be factually incorrect.
Context window. 2,048-token context window.
No formal evaluation. Benchmark results have not been reported for this release.

Citation

@software{kilat2026,
  author = {Abdul Wahid Rukua},
  title  = {Kilat: Kernelized Lightweight Transformer Training Framework},
  year   = {2026},
  url    = {https://github.com/Airukua/kilat}
}

@misc{babykilat2026,
  author = {Abdul Wahid Rukua},
  title  = {BabyKilat: A Lightweight MoE Language Model},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/AiRukua/BabyKilat}
}

Author

Abdul Wahid Rukua
AI/ML Engineer & Researcher

License

MIT License — see LICENSE for details.

Built with ⚡ Kilat — Fast, Lightweight, Transparent

Downloads last month: 407

Safetensors

Model size

0.1B params

Tensor type

F32

AiRukua
/

BabyKilat