CA-front-end byte-LM — multilingual pretraining (NE-Indian languages), 61M

A vocabulary-free, UTF-8-byte, decoder-only language model with a causal Neural Cellular Automaton (NCA) front-end, pretrained from scratch jointly on six low-resource North-East Indian languages (Assamese, Khasi, Manipuri/Meiteilon in Bengali & Meitei-Mayek scripts, Mizo, Nyishi) + English. WMT-2026 Indic MT research track. ~61M params with an enlarged NCA front-end (hidden=1536, steps=12, kernel=5), trained on a ~198 MB byte stream (parallel + scraped monolingual).

Idea: one weight-shared local NCA update rule iterated K=12 steps, perceiving left neighbours only (causal — no future-byte leakage in a next-byte LM), to absorb unseen scripts (Khasi, Nyishi) that no pretrained model covers.

Params: ~61M (d_model=768, layers=8, heads=12, ff=3072, seq_len=512)
Vocab: 266 (256 byte values + PAD/BOS/EOS + 7 language tags)
Data: parallel-corpus target text + general-domain monolingual (Wikipedia + GlotCC); Nyishi is parallel-only (the web has almost no Nyishi text).

Validation bits-per-byte (lower = better) — best @ step 39000

Language	val bpb	byte-perplexity
English	1.399	2.637
Assamese	0.644	1.563
Khasi	1.116	2.167
Manipuri	0.571	1.486
Meitei-Mayek	0.743	1.674
Mizo	1.4	2.639
Nyishi	1.912	3.763

Mean val bpb: 1.112. (bpb = cross-entropy in bits per UTF-8 byte.) Note: mean bpb is ~1.11 across the 26M / 59M / this ~61M variant — these languages are data-limited, not capacity-limited. Scaling the model and adding monolingual data mainly improved the language that received the most new text (Assamese); Khasi/Nyishi have almost no extra web data.

Usage

# pip install torch huggingface_hub; download ca_byte_lm.py from this repo
from ca_byte_lm import from_hub, generate
model, cfg, meta = from_hub("sujayrittikar/ca-byte-lm-indic6", device="cuda")
print(generate(model, cfg, meta["lang_tag"]["Khasi"], device="cuda"))

Architecture in ca_byte_lm.py; weights in ca_byte_lm.pt.

Status & limitations

A pretrained LM, not a translator yet — translation-tuning (prompted continuation on parallel data) is the downstream step. bpb measures per-script LM quality, not translation. Trained on CC BY-SA (Wikipedia) + CommonCrawl-derived (GlotCC) text; inherits their biases.

Downloads last month: 50