Bifrost Flash 430M
A fast, compact 430M translation model for the Nordic languages ↔ English
(sv, da, nb, nn, fi, is ↔ en), distilled from
Bifrost 1.2B via top-32 logit (KL)
distillation. ~⅓ the size of the teacher — the "flash" option when you want Nordic MT
cheap and quick.
Part of the Bifrost Nordic-translation family from NodeNestor. Same tokenizer and prompt format as the teacher.
Results — FLORES-200 devtest, chrF++ (sacrebleu, n=200/direction)
Overall chrF++ = 54.5 — closing ~60% of the gap to the 1.2B teacher (58.1) at ~⅓ the parameters.
| Direction group | Flash 430M | Teacher 1.2B |
|---|---|---|
| English → Nordic | 53.1 | 57.4 |
| Nordic → English | 60.9 | 63.6 |
| Nordic ↔ Nordic | 50.7 | 54.5 |
| Overall | 54.5 | 58.1 |
Per-direction (chrF++):
| Dir | score | Dir | score | |
|---|---|---|---|---|
| en→sv | 61.5 | sv→en | 65.2 | |
| en→da | 62.2 | da→en | 66.0 | |
| en→nb | 55.6 | nb→en | 63.9 | |
| en→nn | 55.0 | nn→en | 67.5 | |
| en→fi | 42.7 | fi→en | 49.9 | |
| en→is | 41.8 | is→en | 52.9 |
Strong into-English (50–68) and across Scandinavian pairs. Weakest out of English into Finnish & Icelandic (the low-resource legs), with elevated off-target there.
Usage
The weights ship as model.safetensors with a self-contained pure-PyTorch
implementation in modeling_flash.py (no external deps beyond torch). The prompt is
a control-token format — [BOS] [<2{tgt}>] {source_ids} [<eos_src>] → generate until
[EOS]; decode only ids < 65000.
Standalone:
import torch, sentencepiece as spm
from modeling_flash import NordicFlash
sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
LANG = {"en":65000,"sv":65001,"da":65002,"nb":65003,"nn":65004,"fi":65005,"is":65006}
m = NordicFlash.from_checkpoint("model.safetensors", device="cuda")
print(sp.decode(m.translate(sp.encode("Hello, how are you?", out_type=int), LANG["sv"])))
# -> Hej, hur är du idag?
HuggingFace (trust_remote_code):
from transformers import AutoModelForCausalLM
import torch, sentencepiece as spm
sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
m = AutoModelForCausalLM.from_pretrained(".", trust_remote_code=True, dtype=torch.bfloat16).cuda().eval()
ids = [1, 65001] + sp.encode("Hello, how are you?", out_type=int) + [65007] # 65001=<2sv>
out = m.generate(torch.tensor([ids]).cuda(), max_new_tokens=128, do_sample=False, eos_token_id=2)
print(sp.decode([t for t in out[0, len(ids):].tolist() if t < 65000]))
Control-token ids: <2en>=65000, <2sv>=65001, <2da>=65002, <2nb>=65003,
<2nn>=65004, <2fi>=65005, <2is>=65006, <eos_src>=65007; [BOS]=1, [EOS]=2.
Run in bf16.
Model details
- Hybrid decoder, ~430M params. 18 layers in a
[dynamic_conv, dynamic_conv, gqa]×6 pattern: data-dependent causal depthwise convolution (local mixing) interleaved with grouped-query attention every 3rd layer (global mixing). - DynaConv layers: per-token softmax kernel (14 taps, 80 kernels × 16 channels), silu gate.
- GQA layers: 16 query / 4 KV heads, head_dim 80, partial rotary (first 25%).
- SwiGLU FFN (3584), RMSNorm, parallel residual, hidden 1280, tied embeddings.
- Context 4096, bf16, vocab 65008 (
nordic_unigram_65kSentencePiece).
Training
- Distilled from Bifrost 1.2B via full-probability top-32 logit KL.
- Data (for the teacher): parallel + monolingual Nordic/English (Wikipedia parallel, DCLM en↔Nordic, Aya cross-lingual, FineWeb-Edu, Nemotron-CC).
Limitations
- Smaller/faster than the teacher → lower quality, especially en→Finnish / Icelandic (elevated off-target there).
- 4096-token context; greedy decoding; not instruction-tuned.
Acknowledgments
- Tokenizer (
nordic_unigram_65k) developed by a collaborator; included here with permission. - Distilled from Bifrost 1.2B.
Citation
@misc{nodenestor_bifrost_flash_2026,
title = {Bifrost Flash 430M},
author = {Nilsson, Ludvig},
year = {2026},
howpublished = {\url{https://huggingface.co/NodeNestor/bifrost-flash-430m}},
note = {NodeNestor; distilled from Bifrost 1.2B}
}
- Downloads last month
- 10