Instructions to use audiosurffer0/Ab_ru_dojo26_7000check with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use audiosurffer0/Ab_ru_dojo26_7000check with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="audiosurffer0/Ab_ru_dojo26_7000check")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("audiosurffer0/Ab_ru_dojo26_7000check") model = AutoModelForSeq2SeqLM.from_pretrained("audiosurffer0/Ab_ru_dojo26_7000check") - Notebooks
- Google Colab
- Kaggle
NLLB-200-600M · Abkhaz → Russian (AB→RU) — back-translator
Fine-tuned facebook/nllb-200-distilled-600M
for Abkhaz (apsua) → Russian translation. Its primary purpose was to serve as the
back-translation model for a low-resource RU→AB pipeline: it converts the large monolingual
Abkhaz corpus into synthetic Russian, creating extra (synthetic-RU, real-AB) training pairs.
Built for the Yandex Data Dojo 2026 low-resource MT track.
Results
| Metric | Value |
|---|---|
| Clean held-out AB→RU (sentence-BLEU, beam=4) | 18.99 |
| (earlier undertrained checkpoint, ckpt-2250) | 11.99 |
Improving this model from 11.99 → 18.99 (+7 BLEU) made the synthetic Russian fluent and grammatical, which was a larger lever for the downstream RU→AB model than domain filtering.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
REPO = "audiosurffer0/Ab_ru_dojo26_7000check"
tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(REPO).to("cuda").eval()
tok.src_lang = "abk_Cyrl" # Abkhaz source (vocab id 256230)
rus = tok.convert_tokens_to_ids("rus_Cyrl")
def translate(text):
enc = tok(text, return_tensors="pt", truncation=True, max_length=128).to("cuda")
out = model.generate(**enc, forced_bos_token_id=rus,
max_new_tokens=128, num_beams=4, do_sample=False)
return tok.decode(out[0], skip_special_tokens=True).replace("rus_Cyrl", "").strip()
Note: the target is
rus_Cyrl, which is a proper special token, so the output is clean (the.replaceabove is just defensive). The Abkhazabk_Cyrlsource token is a regular vocab token (id 256230) added via tokenizer surgery — see the companion RU→AB card.
Training data & procedure
- Real parallel AB↔RU corpus only (~185K pairs). ⚠️ This model was trained exclusively on real data — mixing in back-translation here would be self-poisoning (its Russian side is synthetic and circular).
- Tokenizer: same surgical NLLB-200 tokenizer (
abk_Cyrl+ 26 Abkhaz chars intokenizer.json, vocab 256231). - Full fine-tune (LoRA fails on tied NLLB embeddings), warm-start, checkpoint 7000 (epoch ≈ 0.61, loss 9.15 → 7.39, single GPU).
- Tokenization:
x = tok(ab),labels = tok(text_target=ru),forced_bos = rus_Cyrl.
Limitations & bias
- Training corpus is heavily religious/biblical → register bias.
- Checkpoint 7000 is a strong-but-not-fully-converged epoch; good enough for back-translation, not production.
- Inherits NLLB-200 limitations and its CC-BY-NC-4.0 (non-commercial) license.
Companion model
Used to produce back-translation data for the RU→AB model
(audiosurffer0/nllb-600m-ru-ab-dojo26), which reached 9.98 sentence-BLEU on the contest test.
Citation / license
Base model © Meta AI (NLLB-200), CC-BY-NC-4.0. This derivative is released under the same non-commercial license. Educational/research use.
- Downloads last month
- 7
Model tree for audiosurffer0/Ab_ru_dojo26_7000check
Base model
facebook/nllb-200-distilled-600M