Tibetan Translation Ranking BUDA

Cross-encoder for ranking Tibetan→English translation candidates by contextual relevance. Based on TenzinGayche/nllb_600M_bi_boen_gold (NLLB-200-distilled-600M fine-tuned on Buddhist gold parallel data) — decoder stripped, encoder-only with a mean-pool classification head.

Used in TradutorBUDA to re-rank dictionary lookup results: given a Tibetan term and its Tibetan sentence context, the model scores each candidate English gloss and surfaces the contextually correct meaning first.

Model Files

File	Size	Description
`model.onnx`	399 MB	INT8 quantized ONNX (deployed)
`sentencepiece.bpe.model`	4.6 MB	NLLB-200 SentencePiece tokenizer
`tokenizer.json`	30.8 MB	HuggingFace tokenizer
`tokenizer_config.json`	—	Tokenizer config

Architecture

Backbone: M2M100 encoder from TenzinGayche/nllb_600M_bi_boen_gold (~413M params, d_model=1024). Decoder is discarded.
Pooling: Mean-pool over attended encoder tokens
Head: Dropout(0.1) → Linear(1024, 2)
Scoring: sigmoid(logits[:, 1]) — higher = better translation match

Why NLLB encoder: bert-base-tibetan is monolingual and ModernBERT-large has negligible Tibetan vocabulary, causing cross-lingual mismatch between Tibetan query and English gloss. TenzinGayche/nllb_600M_bi_boen_gold already holds rich cross-lingual representations for exactly this domain (Buddhist Tibetan↔English), making it the natural backbone for cross-encoder ranking.

Input Format

[bod_Tibt] context_tibetan + tibetan_term [eos] gloss_tokens [eos]

Language token [bod_Tibt] is prepended (token id: 256030)
Tibetan context + term encoded up to 400 tokens, then [eos] as separator
English gloss appended up to 100 tokens, then final [eos]
Total cap: 512 tokens

Results (INT8 ONNX, CPU)

Metric	Score
Term-only P@1	0.4480
Term+context P@1	0.6619
Avg latency	72 ms / query
Random baseline P@1	0.1191

P@1 = the correct gloss is ranked first. Evaluated on 346 terms / 1,053 context groups from a held-out 20% validation split.

Note: ModernBERT v4b (trained with English context) reached term-only P@1=0.7081 / term+ctx P@1=0.8082 but at 1,379 ms/query (19× slower). This NLLB model runs in Tibetan-only context and is 19× faster.

Training Data

training_v2_merged.csv (34.6 MB, Tier 1 — included in this repo)

Columns: context_english, context_tibetan, tibetan_term, gloss, label

71,050 rows — 386 unique Tibetan terms, 2,942 unique English glosses
Positive (label=1): 6,885 rows (9.7%) — correct translation in context
Negative (label=0): 64,165 rows (90.3%) — incorrect/irrelevant gloss
Imbalance: 9.3:1 (handled by pos_weight in BCEWithLogitsLoss)
Split: 80% train (56,840) / 20% val (14,210), stratified by label

Data sources: Rangjung Yeshe Tibetan–English dictionary terms with Tibetan sentence contexts. Positive/negative pairs generated from multi-gloss entries. Tier 1 = human-reviewed and surgically corrected.

Training Configuration

Base model: TenzinGayche/nllb_600M_bi_boen_gold
Loss: BCEWithLogitsLoss with pos_weight = (9.3)^0.58 ≈ 3.09
Optimizer: AdamW (lr=1.4e-5, weight_decay=0.02)
LR schedule: 600 warmup steps
Batch size: 16 with GroupSampler (positive + negatives for same term batched together)
Epochs: up to 20, early stopping patience=10 eval steps
HP search: 8 Optuna trials optimizing term-only P@1
Mixed precision: BF16 on NVIDIA A100 (40 GB)
ONNX export: FP32 (1,658 MB) → INT8 dynamic quantization (399 MB, 4× compression)

Usage (ONNX)

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("trabten/tibetan_ranking_BUDA")
tokenizer.src_lang = "bod_Tibt"
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def rank(context_tibetan, tibetan_term, candidates):
    eos_id = tokenizer.eos_token_id
    pad_id = tokenizer.pad_token_id
    all_ids, all_mask = [], []
    for gloss in candidates:
        enc_a = tokenizer(context_tibetan + " " + tibetan_term,
                          add_special_tokens=True, truncation=True, max_length=400)
        enc_b = tokenizer.encode(gloss, add_special_tokens=False)[:100]
        ids = (enc_a["input_ids"] + enc_b + [eos_id])[:512]
        all_ids.append(ids); all_mask.append([1] * len(ids))
    max_len = max(len(x) for x in all_ids)
    input_ids = np.array([x + [pad_id]*(max_len-len(x)) for x in all_ids], dtype=np.int64)
    attn_mask = np.array([x + [0]*(max_len-len(x)) for x in all_mask], dtype=np.int64)
    logits = session.run(["logits"], {"input_ids": input_ids, "attention_mask": attn_mask})[0]
    probs = 1 / (1 + np.exp(-logits[:, 1]))
    return sorted(zip(candidates, probs), key=lambda x: x[1], reverse=True)

results = rank(
    "བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པའི་ལམ་ལ་བཤད་པ།",
    "སེམས་དཔའི་",
    ["bodhisattva", "mind", "being", "hero", "aspirant"]
)
for gloss, score in results:
    print(f"  {score:.3f}  {gloss}")

Background

Classical Tibetan dictionary terms often carry 5–20 English glosses across different translation traditions (e.g. སེམས་དཔའ = "bodhisattva", "hero of mind", "being of awakening", ...). Without context, all are equally valid. This model re-ranks candidates by attending jointly to both the Tibetan sentence context and the English gloss, surfacing the contextually correct meaning.

References

NLLB Team. "No Language Left Behind." Meta AI (2022). https://arxiv.org/abs/2207.04672
TenzinGayche. nllb_600M_bi_boen_gold — NLLB fine-tuned on Buddhist Tibetan↔English gold data.
Rangjung Yeshe Wiki — Tibetan–English dictionary source data.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for trabten/tibetan_ranking_BUDA

Base model

TenzinGayche/nllb_600M_bi_boen_gold

Quantized

(1)

this model

Paper for trabten/tibetan_ranking_BUDA

No Language Left Behind: Scaling Human-Centered Machine Translation

Paper • 2207.04672 • Published Jul 11, 2022 • 4