Brújula-150M-32K-chat

A ~150M-parameter (157M trainable, tied embeddings) DeepSeek-style decoder that runs a 32,768-token context window on a single consumer GPU, fine-tuned for chat and long-context retrieval. I'm not aware of a smaller model with a working 32K context — comparably-tiny models like SmolLM2-135M cap out at 8K, and the smallest mainstream 32K model (Qwen2.5-0.5B) is ~3× larger. (I haven't audited every model on the Hub, so this is not a claim to an absolute record — just that no smaller one is known to me.) It was built and trained end-to-end by one hobbyist on a single Intel Arc B580 (12 GB).

This is v1.0. It is good enough to be useful and, more importantly, small and cheap enough to fine-tune further yourself — that is the point of shipping it.

Honesty first: this is a 150M hobby model, not a Phi/SmolLM competitor. It chats, follows simple instructions, and retrieves facts from very long inputs, but it has the factual ceiling you'd expect at this size (see Limitations). Compare it to other consumer-GPU hobby projects, not to lab models.

What's interesting about it

  • 32K context at 150M. Models this small usually cap out at 2–8K (SmolLM2-135M: 8K; Pythia-160M: 2K), and the smallest mainstream models that actually do 32K are ~0.5B and up (Qwen2.5-0.5B). This one reads and retrieves across 32K tokens.
  • Architecture. Multi-head Latent Attention (MLA, DeepSeek-V2-style low-rank KV/Q — cheap KV at long context) + RoPE + SquaredReLU FFN + tied embeddings, with v3 Block Attention-Residuals (softmax attention over windowed block-sums of earlier layers instead of a plain residual sum; arXiv:2603.15031).
  • 32K via YaRN, baked in. Pre-trained at 1024 tokens, context-extended training-free with a YaRN (NTK-by-parts) RoPE warp, then fine-tuned at 16K with answer-masked SFT. The scaling ships inside the config — no manual RoPE surgery, the 32K behaviour is just there on load.

Usage

This model uses custom modeling code, so pass trust_remote_code=True:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-150M-32K-chat"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()   # "xpu" for Intel Arc; "cpu" works too

prompt = "User: Explain why the sky is blue in one paragraph.\nAssistant:"
ids = tok(prompt, return_tensors="pt").input_ids.to(model.device)
out = model.generate(
    ids, max_new_tokens=256,
    do_sample=True, temperature=0.8, top_p=0.95,
    repetition_penalty=1.3,          # important — see Limitations
    use_cache=False,                 # this reference impl has no KV cache
    pad_token_id=50256, eos_token_id=50256,
)
print(tok.decode(out[0], skip_special_tokens=True))

Chat format (plain, no special chat template):

User: <your message>
Assistant:

The model emits <|endoftext|> (id 50256) to end a turn; generation stops there.

Long-context (retrieval) use

Put the long document first, then ask at the end:

context = open("long_document.txt").read()
prompt = f"{context}\n\nUser: According to the document, who founded the company?\nAssistant:"

It accepts up to 32,768 tokens. Generation has no KV cache (the AttnRes aggregators mix across layers, which makes a per-step cache intricate), so it recomputes the context each step — correct but slow at long context. Short chats are fast.

Limitations (please read)

  • Factual ceiling. At 150M it confabulates. On document-QA it reliably copies facts placed in a long context, but for open-ended factual questions it is often wrong or invented. On a real research paper it answered ~2–3 of 5 fact questions; on a synthetic article ~3 of 5. It can also slip into continuing a document's voice instead of answering.
  • Repetition. Use repetition_penalty ≈ 1.3. Without it, it loops. This single knob was the biggest lever on chat quality in testing.
  • Speed at 32K. No KV cache → long-context generation is slow. Fine for retrieval/needle queries; not meant for long free-form generation at full context.
  • English only, GPT-2 BPE tokenizer (vocab 50257).
  • 32K quality is strongest for retrieval/needle tasks (find a fact in a haystack), not for long-form reasoning across the whole window.

Training

  • Backbone: v3 Block-AttnRes, 816 hidden / 18 layers / 6 heads, MLA (kv=64, q=192), SquaredReLU FFN, tied embeddings. Pre-trained on FineWeb-Edu at 1024 ctx.
  • Context extension: YaRN yarn_aggr (NTK-by-parts ramp β=16 + attention temperature mscale = 0.1·ln(32768/1024)+1), applied training-free, then answer-masked SFT at 16K (loss only on completion tokens) on a passkey/needle retrieval set to restore retrieval at the extended length.
  • Chat: answer-masked SFT on smol-smoltalk, mixed with retrieval windows so the 32K skill survives the chat tuning.
  • Hardware: a single Intel Arc B580 (12 GB). The whole project is a part-time hobby effort.

Reported numbers (own harness; not directly comparable to public leaderboards): the pretrained backbone scores ≈ 19.5 val perplexity (FineWeb-Edu) / 32.7 on WikiText-103 — roughly GPT-2-small territory at this harness (GPT-2 small ≈ 29.3 WikiText here). Chat+retrieval SFT validation perplexity is 4.91 on the SFT mixture (a different distribution, not an LM benchmark — don't compare it to the backbone numbers).

Want to make it better?

That's the idea. It's small enough to fine-tune on a single GPU. Continue SFT on your own chat or domain data, or do a document-QA round to push past the confabulation ceiling. The full training / eval / context-extension code (the faro/Brújula stack) is a separate project — the modeling code in this repo (modeling_brujula_v2.py, configuration_brujula_v2.py) is self-contained and matches that trainer bit-for-bit (logits parity-checked < 1e-5 at build).

License

Apache-2.0.

Downloads last month
24
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Sakatepon/Brujula-150M-32K-chat