Brújula-18M (G_stack)

A 18M-parameter decoder-only LM created by depth-growing Brújula-15M with the G_stack operator (Du et al., "Stacking Your Transformers", NeurIPS 2024): copy the trained 15M's 4 layers into an 8-layer model, then continue pre-training. The whole thing runs on a single consumer GPU (Intel Arc B580). Brújula ("compass" in Spanish) uses a minimal DeepSeek-style architecture (MLA + RoPE + SquaredReLU, tied embeddings, Muon).

The result: depth-doubling the 15M champion nearly halved perplexity on both metrics, for ~3h of extra local compute.

Results

Perplexity (lower is better), fixed local harness at context length 1024:

Model FineWeb-Edu val PPL WikiText-103 PPL
Brújula-15M (the base, 4 layers) 78.05 190.74
Brújula-18M (this model, 8 layers) 46.26 108.72
improvement −41% −43%

Honest note: the gain comes from added depth + warm-start (the grown model effectively saw ~2× the cumulative tokens of the base), not depth alone — no from-scratch-8-layer control was run. Either way, it's the best sub-50M model in the family.

The Brújula family

Model Params FineWeb val WikiText Notes
Brújula-15M 15.5M 78.05 190.74 tiny champion, from scratch on one Arc B580
Brújula-18M 18M 46.26 108.72 this model — Brújula-15M G_stack-grown (4→8 layers)
Brújula-150M 153.6M 21.44 36.08 the flagship

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-18M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()

ids = tok("The mitochondria is the", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

Small base model — use sampling + a continuation cue; greedy tends to repetition-loop.

Architecture

Type decoder-only, causal LM
Hidden / Layers / Heads n_embd=256 / n_layer=8 (grown from 4 via G_stack) / n_head=4
Context length 1024
Attention Multi-head Latent Attention (MLA), kv-compress 32 / q-compress 64
Position / FFN / Norm RoPE / SquaredReLU / RMSNorm (pre-norm), tied embeddings
Vocab 50257 (GPT-2 BPE)
Unique params 18.0M

How it was made (G_stack)

  1. Train Brújula-15M from scratch (4 layers).
  2. Stack: copy the 4 trained layers into an 8-layer model ([0,1,2,3] → [0,1,2,3,0,1,2,3]); keep the embedding / final-norm / tied head.
  3. Continue pre-training 1 epoch on FineWeb-Edu (~1.4B tokens), batch 32, peak LR 1.2e-3, bf16, ~3h14m on one Intel Arc B580.

The post-stack loss spikes (copy-init isn't function-preserving), then recovers and surpasses the 15M base within the first ~12% of the epoch — consistent with G_stack's finding that violating function preservation is fine and even preferable.

Limitations

  • Base completion model — not instruction-tuned, no safety tuning.
  • English only, educational-web distribution (FineWeb-Edu); weaker out-of-distribution.
  • ~18M params: plausible prose, unreliable facts; best on cued, definitional prompts.
  • Short context (1024); no KV-cache in this reference implementation.

License & attribution

  • Model + code: Apache-2.0. Training data: FineWeb-Edu (ODC-BY).
  • Methods: G_stack (Du et al., 2024), DeepSeek-V2 (MLA), Muon, Primer (SquaredReLU), GPT-2 (BPE).
Downloads last month
20
Safetensors
Model size
18M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Sakatepon/Brujula-18M