Brújula-18M (G_stack)

A 18M-parameter decoder-only LM created by depth-growing Brújula-15M with the G_stack operator (Du et al., "Stacking Your Transformers", NeurIPS 2024): copy the trained 15M's 4 layers into an 8-layer model, then continue pre-training. The whole thing runs on a single consumer GPU (Intel Arc B580). Brújula ("compass" in Spanish) uses a minimal DeepSeek-style architecture (MLA + RoPE + SquaredReLU, tied embeddings, Muon).

The result: depth-doubling the 15M champion nearly halved perplexity on both metrics, for ~3h of extra local compute.

Results

Perplexity (lower is better), fixed local harness at context length 1024:

Model	FineWeb-Edu val PPL	WikiText-103 PPL
Brújula-15M (the base, 4 layers)	78.05	190.74
Brújula-18M (this model, 8 layers)	46.26	108.72
improvement	−41%	−43%

Honest note: the gain comes from added depth + warm-start (the grown model effectively saw ~2× the cumulative tokens of the base), not depth alone — no from-scratch-8-layer control was run. Either way, it's the best sub-50M model in the family.

The Brújula family

Model	Params	FineWeb val	WikiText	Notes
Brújula-15M	15.5M	78.05	190.74	tiny champion, from scratch on one Arc B580
Brújula-18M	18M	46.26	108.72	this model — Brújula-15M G_stack-grown (4→8 layers)
Brújula-150M	153.6M	21.44	36.08	the flagship

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-18M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()

ids = tok("The mitochondria is the", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

Small base model — use sampling + a continuation cue; greedy tends to repetition-loop.

Architecture


Type	decoder-only, causal LM
Hidden / Layers / Heads	`n_embd=256` / `n_layer=8` (grown from 4 via G_stack) / `n_head=4`
Context length	1024
Attention	Multi-head Latent Attention (MLA), kv-compress 32 / q-compress 64
Position / FFN / Norm	RoPE / SquaredReLU / RMSNorm (pre-norm), tied embeddings
Vocab	50257 (GPT-2 BPE)
Unique params	18.0M

How it was made (G_stack)

Train Brújula-15M from scratch (4 layers).
Stack: copy the 4 trained layers into an 8-layer model ([0,1,2,3] → [0,1,2,3,0,1,2,3]); keep the embedding / final-norm / tied head.
Continue pre-training 1 epoch on FineWeb-Edu (~1.4B tokens), batch 32, peak LR 1.2e-3, bf16, ~3h14m on one Intel Arc B580.

The post-stack loss spikes (copy-init isn't function-preserving), then recovers and surpasses the 15M base within the first ~12% of the epoch — consistent with G_stack's finding that violating function preservation is fine and even preferable.

Limitations

Base completion model — not instruction-tuned, no safety tuning.
English only, educational-web distribution (FineWeb-Edu); weaker out-of-distribution.
~18M params: plausible prose, unreliable facts; best on cued, definitional prompts.
Short context (1024); no KV-cache in this reference implementation.

License & attribution

Model + code: Apache-2.0. Training data: FineWeb-Edu (ODC-BY).
Methods: G_stack (Du et al., 2024), DeepSeek-V2 (MLA), Muon, Primer (SquaredReLU), GPT-2 (BPE).

Downloads last month: 20

Safetensors

Model size

18M params

Tensor type

F32

Sakatepon
/

Brujula-18M