Instructions to use Quazim0t0/Escarda-86M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Quazim0t0/Escarda-86M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Quazim0t0/Escarda-86M", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Quazim0t0/Escarda-86M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Quazim0t0/Escarda-86M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Quazim0t0/Escarda-86M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Escarda-86M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Quazim0t0/Escarda-86M
- SGLang
How to use Quazim0t0/Escarda-86M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Quazim0t0/Escarda-86M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Escarda-86M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Quazim0t0/Escarda-86M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Escarda-86M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Quazim0t0/Escarda-86M with Docker Model Runner:
docker model run hf.co/Quazim0t0/Escarda-86M
Escarda-86M
Escarda-86M is a ~86M-parameter, from-scratch decoder-only language model built for the community as a general-purpose small chat model. It packs a number of recent architecture ideas β Multi-head Latent Attention, an n-gram "engram" memory, hyper-connections, a hierarchical reasoning refinement step, and JEPA / multi-token-prediction auxiliary objectives β into a model small enough to run on a laptop or a free CPU tier.
It was trained using Modal's credits as part of the Small Models, Big Adventures Hackathon, and was selected as the best chat checkpoint after a seed-controlled bake-off across 28 candidate checkpoints plus a head-to-head battle test (chosen for coherence, instruction-following, and resistance to repetition collapse).
Live demo: Quazim0t0/Escarda-86M-Chat
Related models: base checkpoint β Quazim0t0/Escarda-86M-Base (the better starting point for a fresh SFT run).
Benchmarks below (jump to Evaluation).
Model summary
| Parameters | ~85.7M (tie_word_embeddings=True) |
| Type | Decoder-only autoregressive LM (SpikeWhaleLM, model_type: spike_whale) |
| Hidden size | 640 |
| Layers | 16 |
| Attention heads | 10 (head_dim=64), 1 KV head (multi-query) |
| Context length | 4096 tokens |
| Vocab size | 16,512 (custom ChatML-aware tokenizer) |
| Positional encoding | Decoupled RoPE (theta=10000) + NoPE split |
| Precision | trained in float32 |
| License | Apache-2.0 |
Architecture
Escarda is a dense decoder Transformer whose blocks are assembled from several
non-standard components. Flags below reflect the released checkpoint's config.json.
Attention β Multi-head Latent Attention (MLA) + XSA
use_xsa=True, use_qk_norm=True
- MLA-style low-rank projections: queries and the output projection are LoRA-compressed
(
q_lora_rank=128,o_lora_rank=128), keeping the attention parameter/KV footprint small. - Decoupled position encoding: each head splits into a RoPE part (
qk_rope_head_dim=16) and a NoPE part (nope_head_dim=48), so some of the head dimension carries explicit rotary position while the rest stays position-agnostic. - Multi-query attention:
num_key_value_heads=1β all query heads share a single KV head, shrinking the KV cache for cheap inference. - QK-norm stabilizes attention logits.
Engram n-gram memory
use_engram=True
A lightweight associative memory that hashes local n-grams (up to engram_max_ngram=3)
into a learned table (engram_table_size=4096, engram_num_heads=2,
engram_compress_dim=32) and gates the result back into the residual stream
(engram_gate_init_bias=-1.0, i.e. gated mostly-off at init). It gives the small model a
cheap surface-pattern lookup without spending depth on it.
Hash-lookup layers
num_hash_layers=2 β multi-head hash lookups (MultiHeadHashLookup) provide additional
content-addressable features alongside the standard token embeddings.
Hyper-Connections (instead of plain residuals)
use_hyper_connections=True (hc_mult=2, hc_sinkhorn_iters=20, hc_eps=1e-6)
Replaces the standard residual add with learned, width-expanded connections mixed via a
Sinkhorn-normalized routing, letting the network learn how information flows between the
residual streams rather than fixing it to a single identity path.
HRM refinement
use_hrm_refine=True (hrm_refine_dim=128, hrm_refine_steps=1)
A small Hierarchical Reasoning Model block that performs an extra latent refinement
pass over hidden states before the output head β a cheap "think a bit more" step.
Feed-forward (MoE-capable, dense in this release)
The block supports a DeepSeek-style sparse Mixture-of-Experts FFN
(n_routed_experts=6, n_shared_experts=1, num_experts_per_tok=2,
scoring_func=sqrtsoftplus), but this checkpoint ships dense (use_moe=False,
moe_layers=[]) for simplicity and predictable latency.
Training-time auxiliary objectives
These shape the representations during pretraining (they add no inference cost):
- JEPA (
use_jepa=True,jepa_pred_dim=256,jepa_horizon=1,jepa_loss_weight=0.1) β a Joint-Embedding Predictive auxiliary loss predicting future latent states. - Multi-Token Prediction (MTP) (
num_nextn_predict_layers=1,mtp_loss_weight=0.3) β a DeepSeek-V3-style extra head predicting more than one next token. - z-loss (
zloss_coef=1e-4) for logit stability.
Tokenizer & chat format
Escarda uses a custom ChatML-aware tokenizer (16,512 vocab) with atomic special tokens
for framing and reasoning/tool markers (<|im_start|>, <|im_end|>, <think>,
<begin_solution>, β¦). <bos> (id 2) is prepended to every sequence; <|im_end|> (and
<eos>, id 3) terminate a turn.
A single turn is:
<|im_start|>{role}\n{content}<|im_end|>\n
and generation begins right after a trailing <|im_start|>assistant\n.
Inference
The settings below reproduce the model's best generations (ChatML prompt, nucleus
sampling with top-p 0.9, stop on <|im_end|>):
import torch, torch.nn.functional as F
from model_v2 import SpikeWhaleLM # custom architecture (ship with the repo)
from spike_tokenizer import SpikeTokenizer
from chat_format import format_chat, IM_END
tok = SpikeTokenizer("tokenizer.json")
model = SpikeWhaleLM.from_pretrained("Quazim0t0/Escarda-86M").eval()
end_id = tok.convert_tokens_to_ids(IM_END)
prompt = format_chat([{"role": "user", "content": "Explain photosynthesis in one sentence."}],
add_generation_prompt=True)
ids = torch.tensor(tok.encode(prompt)).unsqueeze(0)
out = model(ids, use_cache=True); past = out.past_key_values; last = out.logits[0, -1]
gen = []
for _ in range(120):
p = F.softmax(last.float() / 0.3, -1)
sp, si = p.sort(descending=True); cut = sp.cumsum(0) > 0.9
cut[1:] = cut[:-1].clone(); cut[0] = False; sp[cut] = 0
nxt = si[torch.multinomial(sp / sp.sum(), 1)].item()
if nxt == end_id: break
gen.append(nxt)
out = model(torch.tensor([[nxt]]), past_key_values=past, use_cache=True)
past = out.past_key_values; last = out.logits[0, -1]
print(tok.decode(gen, skip_special_tokens=True))
Note: Escarda is a custom architecture, not a stock
transformersmodel. Loading requires the SpikeWhale modeling code (model_v2.py,config.py) and the tokenizer helpers (spike_tokenizer.py,chat_format.py). The easiest way to try it is the demo Space.
Evaluation
Zero-shot multiple-choice accuracy, scored by continuation log-likelihood in the
lm-eval-harness style (acc = raw, acc_norm = byte-length-normalized) over the full
validation/test split of each task. Standard error is binomial (sqrt(p(1-p)/n)).
β οΈ These were produced with a local harness that approximates lm-eval-harness (same scoring method; prompt formatting / normalization differ slightly). Treat sub-0.02 gaps as noise. For an official leaderboard number, re-run with
lm-evaldirectly.
Language modeling
byte_ppl is exp(sum_NLL_nats / total_UTF8_bytes) on WikiText-2 test (tokenizer-independent);
BLiMP is the fraction of minimal pairs with logprob(good) > logprob(bad) (12 paradigms Γ 150).
| Metric | Value |
|---|---|
| WikiText-2 byte_ppl β | 2.4898 |
| BLiMP acc β | 0.7483 |
Note: the chat model actually has the best BLiMP (grammatical competence) of the Escarda family, even though the distilled Base has lower perplexity β perplexity alone does not track capability here.
Standard small-model suite
| Task | acc | Β± | acc_norm | Β± |
|---|---|---|---|---|
| arc_easy | 0.3683 | 0.0099 | 0.3628 | 0.0099 |
| arc_challenge | 0.1988 | 0.0117 | 0.2312 | 0.0123 |
| hellaswag | 0.2845 | 0.0045 | 0.2928 | 0.0045 |
| winogrande | 0.5067 | 0.0140 | β | β |
| piqa | 0.5881 | 0.0115 | 0.5800 | 0.0115 |
| openbookqa | 0.1600 | 0.0164 | 0.2720 | 0.0199 |
| boolq | 0.4624 | 0.0087 | β | β |
Random baselines: arc/hellaswag/openbookqa β 0.25; winogrande/boolq β 0.50. As expected at this scale, several tasks sit near chance; piqa (0.58) and the winogrande/boolq tasks carry the most above-baseline signal.
ArithMark-2.0 (AxiomicLabs)
Multiple-choice integer arithmetic (n = 2,500, chance = 0.25).
| Metric | Value |
|---|---|
| acc | 0.2932 Β± 0.0091 |
| acc_norm | 0.2816 Β± 0.0090 |
The flat aggregate hides real structure β Escarda is ~2Γ above chance on multiplication and division, while at/below chance on addition and subtraction:
| Topic | acc_norm | n | Difficulty | acc_norm | n | |
|---|---|---|---|---|---|---|
| division | 0.5385 | 130 | easy | 0.2872 | 1250 | |
| multiplication | 0.5278 | 144 | medium | 0.2973 | 750 | |
| parentheses_two_ops | 0.3352 | 355 | hard | 0.2440 | 500 | |
| mixed_two_ops | 0.2633 | 395 | ||||
| parentheses_three_ops | 0.2558 | 258 | ||||
| addition | 0.2323 | 538 | ||||
| mixed_three_ops | 0.2314 | 242 | ||||
| subtraction | 0.2009 | 438 |
So the model has genuinely learned multiplicative patterns rather than guessing uniformly.
Intended use & limitations
Intended use. General short-form chat, simple how-to/step answers, definitions, drafting, and as a base for further fine-tuning or on-device/edge experiments. The whole point is a model that stays coherent and follows instructions at near-zero marginal cost.
Limitations. At ~86M parameters this is a small model:
- Factual recall and multi-step arithmetic are weak and it will confidently get hard facts wrong β verify anything important.
- Outputs can be repetitive or off-target; it is best at bounded, short responses.
- English-centric; no safety/RLHF alignment tuning β do not deploy in sensitive settings without your own guardrails.
Training
- Compute: Modal credits (Small Models, Big Adventures Hackathon).
- Pipeline: from-scratch pretraining of the SpikeWhale architecture, followed by ChatML
supervised fine-tuning and an RL-prep stage; the released
rl_prep/finalcheckpoint was picked via a seed-controlled bake-off + battle test over 28 candidates. - Objectives: next-token cross-entropy + JEPA + MTP + z-loss auxiliaries.
Token budget & scaling
- Tokens: ~20B (from-scratch pretraining of the SpikeWhale base, ~28k steps), then ChatML SFT.
- Token/param ratio: ~233 tokens/param (20B / 85.7M) β roughly 11β12Γ the Chinchilla ~20-tokens/param compute-optimal heuristic, i.e. a deliberately over-trained small model (the inference-efficient trade-off).
Fitting the Chinchilla data term to this model's own pretraining loss curve gives:
L(D) β 2.611 + 77,715 Β· D^(β0.537) (nats/token, RΒ² = 0.92)
From that fit:
- Compute-optimal tokens for this 86M size β 4.3B β the 20B run is ~4.6Γ past compute-optimal.
- Diminishing-returns knee β 22.5B tokens (where +1B tokens buys < 0.005 nats) β the 20B stopping point lands right at the knee, a well-judged budget.
- The model is parameter-bound, not data-bound at 20B: the capacity term (
0.82 nats) exceeds the data term (0.54), so extra tokens help little. Doubling to 40B is projected to lower loss only0.07 nats (7% perplexity) with negligible downstream gain β the lever for better benchmarks is more parameters, not more tokens.
Caveats: single-size fit (folds irreducible loss + capacity floor into one constant); the cosine-LR decay inflates the fitted exponent, so treat Ξ² as an upper bound; token counts are anchored to the ~20B figure and scale linearly if that differs.
β οΈ Honest disclaimer about the SFT. This model was given only a small amount of supervised fine-tuning, done quickly and without a well-organized or carefully-planned data mix β it was rushed to meet the Hackathon deadline. The SFT stage is almost certainly the weakest link here, not the base. Re-running SFT from Escarda-86M-Base with a cleaner, better-curated dataset and a more deliberate recipe would very likely produce noticeably better results. Treat this checkpoint as a rushed proof-of-concept, and the base as the better starting point if you want to take it further.
Acknowledgements
Built with Modal credits during the Small Models, Big Adventures Hackathon. Made freely available to the community in the belief that small models will soon meaningfully contend with much larger ones β and as an open invitation for others to build on it.
- Downloads last month
- 181