CA-front-end byte-LM β€” multilingual pretraining (NE-Indian languages), 61M

A vocabulary-free, UTF-8-byte, decoder-only language model with a causal Neural Cellular Automaton (NCA) front-end, pretrained from scratch jointly on six low-resource North-East Indian languages (Assamese, Khasi, Manipuri/Meiteilon in Bengali & Meitei-Mayek scripts, Mizo, Nyishi) + English. WMT-2026 Indic MT research track. ~61M params with an enlarged NCA front-end (hidden=1536, steps=12, kernel=5), trained on a ~198 MB byte stream (parallel + scraped monolingual).

Idea: one weight-shared local NCA update rule iterated K=12 steps, perceiving left neighbours only (causal β€” no future-byte leakage in a next-byte LM), to absorb unseen scripts (Khasi, Nyishi) that no pretrained model covers.

  • Params: ~61M (d_model=768, layers=8, heads=12, ff=3072, seq_len=512)
  • Vocab: 266 (256 byte values + PAD/BOS/EOS + 7 language tags)
  • Data: parallel-corpus target text + general-domain monolingual (Wikipedia + GlotCC); Nyishi is parallel-only (the web has almost no Nyishi text).

Validation bits-per-byte (lower = better) β€” best @ step 39000

Language val bpb byte-perplexity
English 1.399 2.637
Assamese 0.644 1.563
Khasi 1.116 2.167
Manipuri 0.571 1.486
Meitei-Mayek 0.743 1.674
Mizo 1.4 2.639
Nyishi 1.912 3.763

Mean val bpb: 1.112. (bpb = cross-entropy in bits per UTF-8 byte.) Note: mean bpb is ~1.11 across the 26M / 59M / this ~61M variant β€” these languages are data-limited, not capacity-limited. Scaling the model and adding monolingual data mainly improved the language that received the most new text (Assamese); Khasi/Nyishi have almost no extra web data.

Usage

# pip install torch huggingface_hub; download ca_byte_lm.py from this repo
from ca_byte_lm import from_hub, generate
model, cfg, meta = from_hub("sujayrittikar/ca-byte-lm-indic6", device="cuda")
print(generate(model, cfg, meta["lang_tag"]["Khasi"], device="cuda"))

Architecture in ca_byte_lm.py; weights in ca_byte_lm.pt.

Status & limitations

A pretrained LM, not a translator yet β€” translation-tuning (prompted continuation on parallel data) is the downstream step. bpb measures per-script LM quality, not translation. Trained on CC BY-SA (Wikipedia) + CommonCrawl-derived (GlotCC) text; inherits their biases.

Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support