CA-front-end byte-LM β multilingual pretraining (NE-Indian languages), 61M
A vocabulary-free, UTF-8-byte, decoder-only language model with a causal Neural Cellular Automaton (NCA) front-end, pretrained from scratch jointly on six low-resource North-East Indian languages (Assamese, Khasi, Manipuri/Meiteilon in Bengali & Meitei-Mayek scripts, Mizo, Nyishi) + English. WMT-2026 Indic MT research track. ~61M params with an enlarged NCA front-end (hidden=1536, steps=12, kernel=5), trained on a ~198 MB byte stream (parallel + scraped monolingual).
Idea: one weight-shared local NCA update rule iterated K=12 steps, perceiving left neighbours only (causal β no future-byte leakage in a next-byte LM), to absorb unseen scripts (Khasi, Nyishi) that no pretrained model covers.
- Params: ~61M (d_model=768, layers=8, heads=12, ff=3072, seq_len=512)
- Vocab: 266 (256 byte values + PAD/BOS/EOS + 7 language tags)
- Data: parallel-corpus target text + general-domain monolingual (Wikipedia + GlotCC); Nyishi is parallel-only (the web has almost no Nyishi text).
Validation bits-per-byte (lower = better) β best @ step 39000
| Language | val bpb | byte-perplexity |
|---|---|---|
| English | 1.399 | 2.637 |
| Assamese | 0.644 | 1.563 |
| Khasi | 1.116 | 2.167 |
| Manipuri | 0.571 | 1.486 |
| Meitei-Mayek | 0.743 | 1.674 |
| Mizo | 1.4 | 2.639 |
| Nyishi | 1.912 | 3.763 |
Mean val bpb: 1.112. (bpb = cross-entropy in bits per UTF-8 byte.) Note: mean bpb is ~1.11 across the 26M / 59M / this ~61M variant β these languages are data-limited, not capacity-limited. Scaling the model and adding monolingual data mainly improved the language that received the most new text (Assamese); Khasi/Nyishi have almost no extra web data.
Usage
# pip install torch huggingface_hub; download ca_byte_lm.py from this repo
from ca_byte_lm import from_hub, generate
model, cfg, meta = from_hub("sujayrittikar/ca-byte-lm-indic6", device="cuda")
print(generate(model, cfg, meta["lang_tag"]["Khasi"], device="cuda"))
Architecture in ca_byte_lm.py; weights in ca_byte_lm.pt.
Status & limitations
A pretrained LM, not a translator yet β translation-tuning (prompted continuation on parallel data) is the downstream step. bpb measures per-script LM quality, not translation. Trained on CC BY-SA (Wikipedia) + CommonCrawl-derived (GlotCC) text; inherits their biases.
- Downloads last month
- 50