β‘ BabyKilat
A Lightweight MoE Language Model with Hybrid Attention
140M parameters | Trained on ~1B tokens | 6,000 steps | Free to use
Model Description
BabyKilat is a small language model featuring:
- Hybrid Attention β Combines efficient global decay heads with precise latent MLA heads
- Mixture-of-Experts (MoE) β 4 experts (1 shared + 3 routed), 2 active per token
- No Positional Encoding (NoPE) β Relies on attention structure rather than explicit position embeddings
- Lightweight Design β Runs on consumer GPUs with just 6GB VRAM (RTX 4050 compatible)
- Built from Scratch β Trained using the custom Kilat framework
Architecture Details
| Parameter | Value |
|---|---|
| Embedding Dimension | 512 |
| Number of Layers | 8 |
| Attention Heads | 8 |
| Head Dimension | 64 |
| Latent Dimension | 128 |
| Sequence Length | 2048 |
| Vocabulary Size | 50,257 (GPT-2) |
| Total Parameters | 140.9M |
| MoE Experts | 4 (2 active, 1 shared) |
| Recall Ratio | 0.5 (balanced) |
| FF Multiplier | 2.667 |
| Positional Encoding | None (NoPE) |
Hybrid Attention Mechanism
Each layer splits heads into two parallel paths:
Input x [B, N, D]
β
βββββββββββββββββ΄βββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββ ββββββββββββββββββββββ
β PATH 1 β β PATH 2 β
β Global Decay β β Latent MLA β
β (O(N) linear) β β (O(NΒ²) exact) β
βββββββββββββββββββββ ββββββββββββββββββββββ
β β
βΌ βΌ
Efficient recurrent High-precision
state update full attention
β β
βββββββββββββββββ¬βββββββββββββββββ
β
βΌ
Learned gate fusion
(per-head, per-token)
- Global Decay Path β Linear-time attention with exponential decay, O(1) KV-cache per step
- Latent MLA Path β Standard softmax attention with 4x compressed KV-cache
- Recall Ratio β 0.5 means half of heads use each path, balanced for quality/efficiency
Quick Start
Option 1: Using Kilat (Recommended)
pip install git+https://github.com/Airukua/kilat.git
from kilat import KilatTransformer
from kilat.data import AutoTokenizer
import torch
model = KilatTransformer.from_pretrained("AiRukua/BabyKilat")
tokenizer = AutoTokenizer.from_pretrained("AiRukua/BabyKilat")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()
prompt = "The future of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.8,
repetition_penalty=1.3,
top_k=50,
top_p=0.95,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Option 2: Using TextGenerator Wrapper
from kilat import KilatTransformer, TextGenerator
from kilat.data import AutoTokenizer
model = KilatTransformer.from_pretrained("AiRukua/BabyKilat")
tokenizer = AutoTokenizer.from_pretrained("AiRukua/BabyKilat")
generator = TextGenerator(model, tokenizer)
text = generator.generate(
"The future of artificial intelligence",
max_new_tokens=80,
temperature=0.9,
do_sample=True,
)
print(text)
Training Details
Want to train it yourself? Google Colab notebook available: π https://colab.research.google.com/drive/1dI8NjMjxzkBBnlkzurE-OlGtKhA14vuo?usp=sharing
Dataset
- Source: FineWeb-Edu (educational content)
- Format: Memory-mapped numpy arrays
- Sequence Length: 2048 tokens
- Tokens Seen: ~123M tokens (6,000 steps Γ 32 batch Γ 8 grad accum Γ 2048 seq len / 1M)
Training Configuration
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) |
| Peak Learning Rate | 6e-3 |
| Weight Decay | 0.1 |
| Gradient Clipping | 1.0 |
| Batch Size | 32 Γ 8 grad accum = 256 effective |
| Sequence Length | 2048 |
| Warmup Steps | 1,000 |
| Scheduler | WSDLR (5,000 decay steps) |
| Precision | BF16 |
| Steps Trained | 6,000 / 50,000 |
Training Results (at step 6,000)
| Metric | Value |
|---|---|
| Train Loss | ~3.33 |
| Eval Loss | 3.348 |
| Perplexity | 28.44 |
Hardware & Estimated Training Time
| GPU | VRAM | Estimated Time |
|---|---|---|
| NVIDIA RTX 6000 Blackwell | 96GB | ~7 hours |
| RTX 4050 | 6GB | ~3β4 days |
| T4 Google Colab | 16GB | ~1β2 days |
Generation Strategies
| Strategy | Parameters | Use Case |
|---|---|---|
| Greedy | do_sample=False |
Deterministic, factual tasks |
| Sampling | do_sample=True, temperature=0.8 |
Creative writing |
| Top-K | top_k=50 |
Focused diversity |
| Top-P | top_p=0.95 |
Nucleus sampling |
Requirements
pip install git+https://github.com/Airukua/kilat.git
pip install torch transformers sentencepiece safetensors
- Python 3.10+
- PyTorch 2.0+
Repository Files
BabyKilat/
βββ config.json # Model configuration
βββ config.yaml # Human-readable config
βββ model.safetensors # Model weights (140M params)
βββ tokenizer_config.json # Tokenizer config
βββ tokenizer.json # GPT-2 tokenizer
βββ vocab.json # Vocabulary
βββ merges.txt # BPE merges
Limitations
- English only. Trained exclusively on English text.
- No instruction tuning. Base language model only, not aligned for chat or task completion.
- Partially trained. This checkpoint represents 6,000 of 50,000 planned steps (~12% of full training). Quality will improve significantly with full training.
- Limited capacity. At 140M parameters, the model underperforms larger models on knowledge-intensive benchmarks. Outputs may be factually incorrect.
- Context window. 2,048-token context window.
- No formal evaluation. Benchmark results have not been reported for this release.
Citation
@software{kilat2026,
author = {Abdul Wahid Rukua},
title = {Kilat: Kernelized Lightweight Transformer Training Framework},
year = {2026},
url = {https://github.com/Airukua/kilat}
}
@misc{babykilat2026,
author = {Abdul Wahid Rukua},
title = {BabyKilat: A Lightweight MoE Language Model},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/AiRukua/BabyKilat}
}
Author
Abdul Wahid Rukua
AI/ML Engineer & Researcher
License
MIT License β see LICENSE for details.
Built with β‘ Kilat β Fast, Lightweight, Transparent
- Downloads last month
- 407