⚑ BabyKilat

A Lightweight MoE Language Model with Hybrid Attention

Model Python 3.10+ PyTorch 2.0+ License: MIT

140M parameters | Trained on ~1B tokens | 6,000 steps | Free to use


Model Description

BabyKilat is a small language model featuring:

  • Hybrid Attention β€” Combines efficient global decay heads with precise latent MLA heads
  • Mixture-of-Experts (MoE) β€” 4 experts (1 shared + 3 routed), 2 active per token
  • No Positional Encoding (NoPE) β€” Relies on attention structure rather than explicit position embeddings
  • Lightweight Design β€” Runs on consumer GPUs with just 6GB VRAM (RTX 4050 compatible)
  • Built from Scratch β€” Trained using the custom Kilat framework

Architecture Details

Parameter Value
Embedding Dimension 512
Number of Layers 8
Attention Heads 8
Head Dimension 64
Latent Dimension 128
Sequence Length 2048
Vocabulary Size 50,257 (GPT-2)
Total Parameters 140.9M
MoE Experts 4 (2 active, 1 shared)
Recall Ratio 0.5 (balanced)
FF Multiplier 2.667
Positional Encoding None (NoPE)

Hybrid Attention Mechanism

Each layer splits heads into two parallel paths:

                    Input x [B, N, D]
                           β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚                                β”‚
           β–Ό                                β–Ό
   ╔═══════════════════╗           ╔════════════════════╗
   β•‘   PATH 1          β•‘           β•‘  PATH 2           β•‘
   β•‘   Global Decay    β•‘           β•‘   Latent MLA       β•‘
   β•‘   (O(N) linear)   β•‘           β•‘   (O(NΒ²) exact)    β•‘
   β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•           β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
           β”‚                                β”‚
           β–Ό                                β–Ό
    Efficient recurrent                  High-precision
    state update                         full attention
           β”‚                                β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                    Learned gate fusion
                    (per-head, per-token)
  • Global Decay Path β€” Linear-time attention with exponential decay, O(1) KV-cache per step
  • Latent MLA Path β€” Standard softmax attention with 4x compressed KV-cache
  • Recall Ratio β€” 0.5 means half of heads use each path, balanced for quality/efficiency

Quick Start

Option 1: Using Kilat (Recommended)

pip install git+https://github.com/Airukua/kilat.git
from kilat import KilatTransformer
from kilat.data import AutoTokenizer
import torch

model = KilatTransformer.from_pretrained("AiRukua/BabyKilat")
tokenizer = AutoTokenizer.from_pretrained("AiRukua/BabyKilat")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

prompt = "The future of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    repetition_penalty=1.3,
    top_k=50,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 2: Using TextGenerator Wrapper

from kilat import KilatTransformer, TextGenerator
from kilat.data import AutoTokenizer

model = KilatTransformer.from_pretrained("AiRukua/BabyKilat")
tokenizer = AutoTokenizer.from_pretrained("AiRukua/BabyKilat")
generator = TextGenerator(model, tokenizer)

text = generator.generate(
    "The future of artificial intelligence",
    max_new_tokens=80,
    temperature=0.9,
    do_sample=True,
)
print(text)

Training Details

Want to train it yourself? Google Colab notebook available: πŸ‘‰ https://colab.research.google.com/drive/1dI8NjMjxzkBBnlkzurE-OlGtKhA14vuo?usp=sharing

Dataset

  • Source: FineWeb-Edu (educational content)
  • Format: Memory-mapped numpy arrays
  • Sequence Length: 2048 tokens
  • Tokens Seen: ~123M tokens (6,000 steps Γ— 32 batch Γ— 8 grad accum Γ— 2048 seq len / 1M)

Training Configuration

Hyperparameter Value
Optimizer AdamW (Ξ²1=0.9, Ξ²2=0.95)
Peak Learning Rate 6e-3
Weight Decay 0.1
Gradient Clipping 1.0
Batch Size 32 Γ— 8 grad accum = 256 effective
Sequence Length 2048
Warmup Steps 1,000
Scheduler WSDLR (5,000 decay steps)
Precision BF16
Steps Trained 6,000 / 50,000

Training Results (at step 6,000)

Metric Value
Train Loss ~3.33
Eval Loss 3.348
Perplexity 28.44

Hardware & Estimated Training Time

GPU VRAM Estimated Time
NVIDIA RTX 6000 Blackwell 96GB ~7 hours
RTX 4050 6GB ~3–4 days
T4 Google Colab 16GB ~1–2 days

Generation Strategies

Strategy Parameters Use Case
Greedy do_sample=False Deterministic, factual tasks
Sampling do_sample=True, temperature=0.8 Creative writing
Top-K top_k=50 Focused diversity
Top-P top_p=0.95 Nucleus sampling

Requirements

pip install git+https://github.com/Airukua/kilat.git
pip install torch transformers sentencepiece safetensors
  • Python 3.10+
  • PyTorch 2.0+

Repository Files

BabyKilat/
β”œβ”€β”€ config.json              # Model configuration
β”œβ”€β”€ config.yaml              # Human-readable config
β”œβ”€β”€ model.safetensors        # Model weights (140M params)
β”œβ”€β”€ tokenizer_config.json    # Tokenizer config
β”œβ”€β”€ tokenizer.json           # GPT-2 tokenizer
β”œβ”€β”€ vocab.json               # Vocabulary
└── merges.txt               # BPE merges

Limitations

  • English only. Trained exclusively on English text.
  • No instruction tuning. Base language model only, not aligned for chat or task completion.
  • Partially trained. This checkpoint represents 6,000 of 50,000 planned steps (~12% of full training). Quality will improve significantly with full training.
  • Limited capacity. At 140M parameters, the model underperforms larger models on knowledge-intensive benchmarks. Outputs may be factually incorrect.
  • Context window. 2,048-token context window.
  • No formal evaluation. Benchmark results have not been reported for this release.

Citation

@software{kilat2026,
  author = {Abdul Wahid Rukua},
  title  = {Kilat: Kernelized Lightweight Transformer Training Framework},
  year   = {2026},
  url    = {https://github.com/Airukua/kilat}
}

@misc{babykilat2026,
  author = {Abdul Wahid Rukua},
  title  = {BabyKilat: A Lightweight MoE Language Model},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/AiRukua/BabyKilat}
}

Author

Abdul Wahid Rukua
AI/ML Engineer & Researcher

GitHub LinkedIn HuggingFace


License

MIT License β€” see LICENSE for details.


Built with ⚑ Kilat β€” Fast, Lightweight, Transparent

Downloads last month
407
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using AiRukua/BabyKilat 1