NLLB-200-600M · Abkhaz → Russian (AB→RU) — back-translator

Fine-tuned facebook/nllb-200-distilled-600M for Abkhaz (apsua) → Russian translation. Its primary purpose was to serve as the back-translation model for a low-resource RU→AB pipeline: it converts the large monolingual Abkhaz corpus into synthetic Russian, creating extra (synthetic-RU, real-AB) training pairs.

Built for the Yandex Data Dojo 2026 low-resource MT track.

Results

Metric Value
Clean held-out AB→RU (sentence-BLEU, beam=4) 18.99
(earlier undertrained checkpoint, ckpt-2250) 11.99

Improving this model from 11.99 → 18.99 (+7 BLEU) made the synthetic Russian fluent and grammatical, which was a larger lever for the downstream RU→AB model than domain filtering.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

REPO = "audiosurffer0/Ab_ru_dojo26_7000check"
tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(REPO).to("cuda").eval()

tok.src_lang = "abk_Cyrl"                          # Abkhaz source (vocab id 256230)
rus = tok.convert_tokens_to_ids("rus_Cyrl")

def translate(text):
    enc = tok(text, return_tensors="pt", truncation=True, max_length=128).to("cuda")
    out = model.generate(**enc, forced_bos_token_id=rus,
                         max_new_tokens=128, num_beams=4, do_sample=False)
    return tok.decode(out[0], skip_special_tokens=True).replace("rus_Cyrl", "").strip()

Note: the target is rus_Cyrl, which is a proper special token, so the output is clean (the .replace above is just defensive). The Abkhaz abk_Cyrl source token is a regular vocab token (id 256230) added via tokenizer surgery — see the companion RU→AB card.

Training data & procedure

  • Real parallel AB↔RU corpus only (~185K pairs). ⚠️ This model was trained exclusively on real data — mixing in back-translation here would be self-poisoning (its Russian side is synthetic and circular).
  • Tokenizer: same surgical NLLB-200 tokenizer (abk_Cyrl + 26 Abkhaz chars in tokenizer.json, vocab 256231).
  • Full fine-tune (LoRA fails on tied NLLB embeddings), warm-start, checkpoint 7000 (epoch ≈ 0.61, loss 9.15 → 7.39, single GPU).
  • Tokenization: x = tok(ab), labels = tok(text_target=ru), forced_bos = rus_Cyrl.

Limitations & bias

  • Training corpus is heavily religious/biblical → register bias.
  • Checkpoint 7000 is a strong-but-not-fully-converged epoch; good enough for back-translation, not production.
  • Inherits NLLB-200 limitations and its CC-BY-NC-4.0 (non-commercial) license.

Companion model

Used to produce back-translation data for the RU→AB model (audiosurffer0/nllb-600m-ru-ab-dojo26), which reached 9.98 sentence-BLEU on the contest test.

Citation / license

Base model © Meta AI (NLLB-200), CC-BY-NC-4.0. This derivative is released under the same non-commercial license. Educational/research use.

Downloads last month
7
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for audiosurffer0/Ab_ru_dojo26_7000check

Finetuned
(306)
this model