CA byte-LM translator — x2en

A vocabulary-free, UTF-8-byte, decoder-only translator with a causal Neural-Cellular-Automaton front-end, fine-tuned (prompted continuation, prefix-LM, target-span loss) from the pretrained base sujayrittikar/ca-byte-lm-indic6 for the x2en direction. WMT-2026 Indic MT research track. Covers Assamese, Khasi, Manipuri, Meitei-Mayek, Mizo, Nyishi ↔ English — and is the only system for Khasi, Nyishi and Meitei-Mayek.

Dev chrF++ (best checkpoint @ step 60000)

language	chrF++
Assamese	21.98
Manipuri	27.66
Mizo	36.73
Khasi	26.43
Nyishi	76.03

Dev numbers are measured on the 2025 test set, which was folded into training — indicative, not held-out.

Usage

from ca_byte_lm import from_hub, translate
model, cfg, meta = from_hub("sujayrittikar/ca-byte-mt-x2en", device="cuda")
print(translate(model, cfg, meta, "Ka sorkar ka la pynbna ia ka jingiaseng thymmai.", "Khasi", "English", device="cuda"))

Weights: ca_byte_lm.pt; architecture: ca_byte_lm.py; config: config.json.

Downloads last month: 17

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support