Escarda-86M

Escarda-86M is a ~86M-parameter, from-scratch decoder-only language model built for the community as a general-purpose small chat model. It packs a number of recent architecture ideas — Multi-head Latent Attention, an n-gram "engram" memory, hyper-connections, a hierarchical reasoning refinement step, and JEPA / multi-token-prediction auxiliary objectives — into a model small enough to run on a laptop or a free CPU tier.

It was trained using Modal's credits as part of the Small Models, Big Adventures Hackathon, and was selected as the best chat checkpoint after a seed-controlled bake-off across 28 candidate checkpoints plus a head-to-head battle test (chosen for coherence, instruction-following, and resistance to repetition collapse).

Live demo: Quazim0t0/Escarda-86M-Chat

Related models: base checkpoint → Quazim0t0/Escarda-86M-Base (the better starting point for a fresh SFT run).

Benchmarks below (jump to Evaluation).

Model summary


Parameters	~85.7M (`tie_word_embeddings=True`)
Type	Decoder-only autoregressive LM (`SpikeWhaleLM`, `model_type: spike_whale`)
Hidden size	640
Layers	16
Attention heads	10 (`head_dim=64`), 1 KV head (multi-query)
Context length	4096 tokens
Vocab size	16,512 (custom ChatML-aware tokenizer)
Positional encoding	Decoupled RoPE (`theta=10000`) + NoPE split
Precision	trained in float32
License	Apache-2.0

Architecture

Escarda is a dense decoder Transformer whose blocks are assembled from several non-standard components. Flags below reflect the released checkpoint's config.json.

Attention — Multi-head Latent Attention (MLA) + XSA

use_xsa=True, use_qk_norm=True

MLA-style low-rank projections: queries and the output projection are LoRA-compressed (q_lora_rank=128, o_lora_rank=128), keeping the attention parameter/KV footprint small.
Decoupled position encoding: each head splits into a RoPE part (qk_rope_head_dim=16) and a NoPE part (nope_head_dim=48), so some of the head dimension carries explicit rotary position while the rest stays position-agnostic.
Multi-query attention: num_key_value_heads=1 — all query heads share a single KV head, shrinking the KV cache for cheap inference.
QK-norm stabilizes attention logits.

Engram n-gram memory

use_engram=True A lightweight associative memory that hashes local n-grams (up to engram_max_ngram=3) into a learned table (engram_table_size=4096, engram_num_heads=2, engram_compress_dim=32) and gates the result back into the residual stream (engram_gate_init_bias=-1.0, i.e. gated mostly-off at init). It gives the small model a cheap surface-pattern lookup without spending depth on it.

Hash-lookup layers

num_hash_layers=2 — multi-head hash lookups (MultiHeadHashLookup) provide additional content-addressable features alongside the standard token embeddings.

Hyper-Connections (instead of plain residuals)

use_hyper_connections=True (hc_mult=2, hc_sinkhorn_iters=20, hc_eps=1e-6) Replaces the standard residual add with learned, width-expanded connections mixed via a Sinkhorn-normalized routing, letting the network learn how information flows between the residual streams rather than fixing it to a single identity path.

HRM refinement

use_hrm_refine=True (hrm_refine_dim=128, hrm_refine_steps=1) A small Hierarchical Reasoning Model block that performs an extra latent refinement pass over hidden states before the output head — a cheap "think a bit more" step.

Feed-forward (MoE-capable, dense in this release)

The block supports a DeepSeek-style sparse Mixture-of-Experts FFN (n_routed_experts=6, n_shared_experts=1, num_experts_per_tok=2, scoring_func=sqrtsoftplus), but this checkpoint ships dense (use_moe=False, moe_layers=[]) for simplicity and predictable latency.

Training-time auxiliary objectives

These shape the representations during pretraining (they add no inference cost):

JEPA (use_jepa=True, jepa_pred_dim=256, jepa_horizon=1, jepa_loss_weight=0.1) — a Joint-Embedding Predictive auxiliary loss predicting future latent states.
Multi-Token Prediction (MTP) (num_nextn_predict_layers=1, mtp_loss_weight=0.3) — a DeepSeek-V3-style extra head predicting more than one next token.
z-loss (zloss_coef=1e-4) for logit stability.

Tokenizer & chat format

Escarda uses a custom ChatML-aware tokenizer (16,512 vocab) with atomic special tokens for framing and reasoning/tool markers (<|im_start|>, <|im_end|>, <think>, <begin_solution>, …). <bos> (id 2) is prepended to every sequence; <|im_end|> (and <eos>, id 3) terminate a turn.

A single turn is:

<|im_start|>{role}\n{content}<|im_end|>\n

and generation begins right after a trailing <|im_start|>assistant\n.

Inference

The settings below reproduce the model's best generations (ChatML prompt, nucleus sampling with top-p 0.9, stop on <|im_end|>):

import torch, torch.nn.functional as F
from model_v2 import SpikeWhaleLM        # custom architecture (ship with the repo)
from spike_tokenizer import SpikeTokenizer
from chat_format import format_chat, IM_END

tok = SpikeTokenizer("tokenizer.json")
model = SpikeWhaleLM.from_pretrained("Quazim0t0/Escarda-86M").eval()
end_id = tok.convert_tokens_to_ids(IM_END)

prompt = format_chat([{"role": "user", "content": "Explain photosynthesis in one sentence."}],
                     add_generation_prompt=True)
ids = torch.tensor(tok.encode(prompt)).unsqueeze(0)
out = model(ids, use_cache=True); past = out.past_key_values; last = out.logits[0, -1]
gen = []
for _ in range(120):
    p = F.softmax(last.float() / 0.3, -1)
    sp, si = p.sort(descending=True); cut = sp.cumsum(0) > 0.9
    cut[1:] = cut[:-1].clone(); cut[0] = False; sp[cut] = 0
    nxt = si[torch.multinomial(sp / sp.sum(), 1)].item()
    if nxt == end_id: break
    gen.append(nxt)
    out = model(torch.tensor([[nxt]]), past_key_values=past, use_cache=True)
    past = out.past_key_values; last = out.logits[0, -1]
print(tok.decode(gen, skip_special_tokens=True))

Note: Escarda is a custom architecture, not a stock transformers model. Loading requires the SpikeWhale modeling code (model_v2.py, config.py) and the tokenizer helpers (spike_tokenizer.py, chat_format.py). The easiest way to try it is the demo Space.

Evaluation

Zero-shot multiple-choice accuracy, scored by continuation log-likelihood in the lm-eval-harness style (acc = raw, acc_norm = byte-length-normalized) over the full validation/test split of each task. Standard error is binomial (sqrt(p(1-p)/n)).

⚠️ These were produced with a local harness that approximates lm-eval-harness (same scoring method; prompt formatting / normalization differ slightly). Treat sub-0.02 gaps as noise. For an official leaderboard number, re-run with lm-eval directly.

Language modeling

byte_ppl is exp(sum_NLL_nats / total_UTF8_bytes) on WikiText-2 test (tokenizer-independent); BLiMP is the fraction of minimal pairs with logprob(good) > logprob(bad) (12 paradigms × 150).

Metric	Value
WikiText-2 byte_ppl ↓	2.4898
BLiMP acc ↑	0.7483

Note: the chat model actually has the best BLiMP (grammatical competence) of the Escarda family, even though the distilled Base has lower perplexity — perplexity alone does not track capability here.

Standard small-model suite

Task	acc	±	acc_norm	±
arc_easy	0.3683	0.0099	0.3628	0.0099
arc_challenge	0.1988	0.0117	0.2312	0.0123
hellaswag	0.2845	0.0045	0.2928	0.0045
winogrande	0.5067	0.0140	—	—
piqa	0.5881	0.0115	0.5800	0.0115
openbookqa	0.1600	0.0164	0.2720	0.0199
boolq	0.4624	0.0087	—	—

Random baselines: arc/hellaswag/openbookqa ≈ 0.25; winogrande/boolq ≈ 0.50. As expected at this scale, several tasks sit near chance; piqa (0.58) and the winogrande/boolq tasks carry the most above-baseline signal.

ArithMark-2.0 (AxiomicLabs)

Multiple-choice integer arithmetic (n = 2,500, chance = 0.25).

Metric	Value
acc	0.2932 ± 0.0091
acc_norm	0.2816 ± 0.0090

The flat aggregate hides real structure — Escarda is ~2× above chance on multiplication and division, while at/below chance on addition and subtraction:

Topic	acc_norm	n	Difficulty	acc_norm	n
division	0.5385	130	easy	0.2872	1250
multiplication	0.5278	144	medium	0.2973	750
parentheses_two_ops	0.3352	355	hard	0.2440	500
mixed_two_ops	0.2633	395
parentheses_three_ops	0.2558	258
addition	0.2323	538
mixed_three_ops	0.2314	242
subtraction	0.2009	438

So the model has genuinely learned multiplicative patterns rather than guessing uniformly.

Intended use & limitations

Intended use. General short-form chat, simple how-to/step answers, definitions, drafting, and as a base for further fine-tuning or on-device/edge experiments. The whole point is a model that stays coherent and follows instructions at near-zero marginal cost.

Limitations. At ~86M parameters this is a small model:

Factual recall and multi-step arithmetic are weak and it will confidently get hard facts wrong — verify anything important.
Outputs can be repetitive or off-target; it is best at bounded, short responses.
English-centric; no safety/RLHF alignment tuning — do not deploy in sensitive settings without your own guardrails.

Training

Compute: Modal credits (Small Models, Big Adventures Hackathon).
Pipeline: from-scratch pretraining of the SpikeWhale architecture, followed by ChatML supervised fine-tuning and an RL-prep stage; the released rl_prep/final checkpoint was picked via a seed-controlled bake-off + battle test over 28 candidates.
Objectives: next-token cross-entropy + JEPA + MTP + z-loss auxiliaries.

Token budget & scaling

Tokens: ~20B (from-scratch pretraining of the SpikeWhale base, ~28k steps), then ChatML SFT.
Token/param ratio: ~233 tokens/param (20B / 85.7M) — roughly 11–12× the Chinchilla ~20-tokens/param compute-optimal heuristic, i.e. a deliberately over-trained small model (the inference-efficient trade-off).

Fitting the Chinchilla data term to this model's own pretraining loss curve gives:

L(D) ≈ 2.611 + 77,715 · D^(−0.537) (nats/token, R² = 0.92)

From that fit:

Compute-optimal tokens for this 86M size ≈ 4.3B → the 20B run is ~4.6× past compute-optimal.
Diminishing-returns knee ≈ 22.5B tokens (where +1B tokens buys < 0.005 nats) — the 20B stopping point lands right at the knee, a well-judged budget.
The model is parameter-bound, not data-bound at 20B: the capacity term (~~0.82 nats) exceeds the data term (~~0.54), so extra tokens help little. Doubling to 40B is projected to lower loss only ~~0.07 nats (~~7% perplexity) with negligible downstream gain — the lever for better benchmarks is more parameters, not more tokens.

Caveats: single-size fit (folds irreducible loss + capacity floor into one constant); the cosine-LR decay inflates the fitted exponent, so treat β as an upper bound; token counts are anchored to the ~20B figure and scale linearly if that differs.

⚠️ Honest disclaimer about the SFT. This model was given only a small amount of supervised fine-tuning, done quickly and without a well-organized or carefully-planned data mix — it was rushed to meet the Hackathon deadline. The SFT stage is almost certainly the weakest link here, not the base. Re-running SFT from Escarda-86M-Base with a cleaner, better-curated dataset and a more deliberate recipe would very likely produce noticeably better results. Treat this checkpoint as a rushed proof-of-concept, and the base as the better starting point if you want to take it further.

Acknowledgements

Built with Modal credits during the Small Models, Big Adventures Hackathon. Made freely available to the community in the belief that small models will soon meaningfully contend with much larger ones — and as an open invitation for others to build on it.

Downloads last month: 181

Safetensors

Model size

97.3M params

Tensor type

F32

Quazim0t0
/

Escarda-86M