CA byte-LM translator β€” x2en

A vocabulary-free, UTF-8-byte, decoder-only translator with a causal Neural-Cellular-Automaton front-end, fine-tuned (prompted continuation, prefix-LM, target-span loss) from the pretrained base sujayrittikar/ca-byte-lm-indic6 for the x2en direction. WMT-2026 Indic MT research track. Covers Assamese, Khasi, Manipuri, Meitei-Mayek, Mizo, Nyishi ↔ English β€” and is the only system for Khasi, Nyishi and Meitei-Mayek.

Dev chrF++ (best checkpoint @ step 60000)

language chrF++
Assamese 21.98
Manipuri 27.66
Mizo 36.73
Khasi 26.43
Nyishi 76.03

Dev numbers are measured on the 2025 test set, which was folded into training β€” indicative, not held-out.

Usage

from ca_byte_lm import from_hub, translate
model, cfg, meta = from_hub("sujayrittikar/ca-byte-mt-x2en", device="cuda")
print(translate(model, cfg, meta, "Ka sorkar ka la pynbna ia ka jingiaseng thymmai.", "Khasi", "English", device="cuda"))

Weights: ca_byte_lm.pt; architecture: ca_byte_lm.py; config: config.json.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support