Tibetan Translation Ranking BUDA

Cross-encoder for ranking Tibetan→English translation candidates by contextual relevance. Based on TenzinGayche/nllb_600M_bi_boen_gold (NLLB-200-distilled-600M fine-tuned on Buddhist gold parallel data) — decoder stripped, encoder-only with a mean-pool classification head.

Used in TradutorBUDA to re-rank dictionary lookup results: given a Tibetan term and its Tibetan sentence context, the model scores each candidate English gloss and surfaces the contextually correct meaning first.

Model Files

File Size Description
model.onnx 399 MB INT8 quantized ONNX (deployed)
sentencepiece.bpe.model 4.6 MB NLLB-200 SentencePiece tokenizer
tokenizer.json 30.8 MB HuggingFace tokenizer
tokenizer_config.json Tokenizer config

Architecture

  • Backbone: M2M100 encoder from TenzinGayche/nllb_600M_bi_boen_gold (~413M params, d_model=1024). Decoder is discarded.
  • Pooling: Mean-pool over attended encoder tokens
  • Head: Dropout(0.1) → Linear(1024, 2)
  • Scoring: sigmoid(logits[:, 1]) — higher = better translation match

Why NLLB encoder: bert-base-tibetan is monolingual and ModernBERT-large has negligible Tibetan vocabulary, causing cross-lingual mismatch between Tibetan query and English gloss. TenzinGayche/nllb_600M_bi_boen_gold already holds rich cross-lingual representations for exactly this domain (Buddhist Tibetan↔English), making it the natural backbone for cross-encoder ranking.

Input Format

[bod_Tibt] context_tibetan + tibetan_term [eos] gloss_tokens [eos]
  • Language token [bod_Tibt] is prepended (token id: 256030)
  • Tibetan context + term encoded up to 400 tokens, then [eos] as separator
  • English gloss appended up to 100 tokens, then final [eos]
  • Total cap: 512 tokens

Results (INT8 ONNX, CPU)

Metric Score
Term-only P@1 0.4480
Term+context P@1 0.6619
Avg latency 72 ms / query
Random baseline P@1 0.1191

P@1 = the correct gloss is ranked first. Evaluated on 346 terms / 1,053 context groups from a held-out 20% validation split.

Note: ModernBERT v4b (trained with English context) reached term-only P@1=0.7081 / term+ctx P@1=0.8082 but at 1,379 ms/query (19× slower). This NLLB model runs in Tibetan-only context and is 19× faster.

Training Data

training_v2_merged.csv (34.6 MB, Tier 1 — included in this repo)

Columns: context_english, context_tibetan, tibetan_term, gloss, label

  • 71,050 rows — 386 unique Tibetan terms, 2,942 unique English glosses
  • Positive (label=1): 6,885 rows (9.7%) — correct translation in context
  • Negative (label=0): 64,165 rows (90.3%) — incorrect/irrelevant gloss
  • Imbalance: 9.3:1 (handled by pos_weight in BCEWithLogitsLoss)
  • Split: 80% train (56,840) / 20% val (14,210), stratified by label

Data sources: Rangjung Yeshe Tibetan–English dictionary terms with Tibetan sentence contexts. Positive/negative pairs generated from multi-gloss entries. Tier 1 = human-reviewed and surgically corrected.

Training Configuration

  • Base model: TenzinGayche/nllb_600M_bi_boen_gold
  • Loss: BCEWithLogitsLoss with pos_weight = (9.3)^0.58 ≈ 3.09
  • Optimizer: AdamW (lr=1.4e-5, weight_decay=0.02)
  • LR schedule: 600 warmup steps
  • Batch size: 16 with GroupSampler (positive + negatives for same term batched together)
  • Epochs: up to 20, early stopping patience=10 eval steps
  • HP search: 8 Optuna trials optimizing term-only P@1
  • Mixed precision: BF16 on NVIDIA A100 (40 GB)
  • ONNX export: FP32 (1,658 MB) → INT8 dynamic quantization (399 MB, 4× compression)

Usage (ONNX)

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("trabten/tibetan_ranking_BUDA")
tokenizer.src_lang = "bod_Tibt"
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def rank(context_tibetan, tibetan_term, candidates):
    eos_id = tokenizer.eos_token_id
    pad_id = tokenizer.pad_token_id
    all_ids, all_mask = [], []
    for gloss in candidates:
        enc_a = tokenizer(context_tibetan + " " + tibetan_term,
                          add_special_tokens=True, truncation=True, max_length=400)
        enc_b = tokenizer.encode(gloss, add_special_tokens=False)[:100]
        ids = (enc_a["input_ids"] + enc_b + [eos_id])[:512]
        all_ids.append(ids); all_mask.append([1] * len(ids))
    max_len = max(len(x) for x in all_ids)
    input_ids = np.array([x + [pad_id]*(max_len-len(x)) for x in all_ids], dtype=np.int64)
    attn_mask = np.array([x + [0]*(max_len-len(x)) for x in all_mask], dtype=np.int64)
    logits = session.run(["logits"], {"input_ids": input_ids, "attention_mask": attn_mask})[0]
    probs = 1 / (1 + np.exp(-logits[:, 1]))
    return sorted(zip(candidates, probs), key=lambda x: x[1], reverse=True)

results = rank(
    "བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པའི་ལམ་ལ་བཤད་པ།",
    "སེམས་དཔའི་",
    ["bodhisattva", "mind", "being", "hero", "aspirant"]
)
for gloss, score in results:
    print(f"  {score:.3f}  {gloss}")

Background

Classical Tibetan dictionary terms often carry 5–20 English glosses across different translation traditions (e.g. སེམས་དཔའ = "bodhisattva", "hero of mind", "being of awakening", ...). Without context, all are equally valid. This model re-ranks candidates by attending jointly to both the Tibetan sentence context and the English gloss, surfacing the contextually correct meaning.

References

  • NLLB Team. "No Language Left Behind." Meta AI (2022). https://arxiv.org/abs/2207.04672
  • TenzinGayche. nllb_600M_bi_boen_gold — NLLB fine-tuned on Buddhist Tibetan↔English gold data.
  • Rangjung Yeshe Wiki — Tibetan–English dictionary source data.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for trabten/tibetan_ranking_BUDA

Quantized
(1)
this model

Paper for trabten/tibetan_ranking_BUDA