NE-Trans v0 — 600M
MWire Labs | Kren Stack
Baseline validation checkpoint for NE-Trans, a multilingual MT system for Northeast Indian languages. Built on NLLB-200-distilled-600M with LoRA fine-tuning.
Languages
- Bodo (brx_Deva) — existing NLLB token
- Kokborok (trp_Latn) — new token added
- Khasi (kha_Latn) — new token added
Training
- Data: WMT 2025 official training data only (no augmentation)
- Bodo: 12,129 pairs | Kokborok: 2,152 pairs | Khasi: 24,699 pairs
- Both directions (en→X and X→en) jointly trained
- LoRA r=16, alpha=32, target: q_proj/v_proj
- 10 epochs, batch size 32, lr 5e-4
Val BLEU (WMT 2025 val split, no augmentation)
| Direction | BLEU | 2025 Best |
|---|---|---|
| en→bodo | 21.91 | 19.71 |
| bodo→en | 32.19 | 21.68 |
| en→kokborok | 5.14 | 6.90 |
| kokborok→en | 10.06 | 2.99 |
| en→khasi | 24.88 | 10.81 |
| khasi→en | 24.99 | 14.26 |
Notes
- This is a baseline validation run — no back-translation, no QE filtering, no reranker
- en→kokborok underperforms due to extremely limited data (2,152 pairs) and randomly initialized trp_Latn token
- Full NE-Trans system (3.3B, augmented data, CPT reranker) to follow for WMT 2026 submission
- Part of the Kren Stack: NE-LID, NE-BERT, NE-ASR, NE-OCR, NE-CLIP, NE-Trans
Citation
If you use this model, please cite MWire Labs and WMT 2026 NE-Trans system paper (forthcoming).
- Downloads last month
- 29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support