Tibetan Translation Ranking BUDA
Cross-encoder for ranking Tibetan→English translation candidates by contextual relevance. Based on TenzinGayche/nllb_600M_bi_boen_gold (NLLB-200-distilled-600M fine-tuned on Buddhist gold parallel data) — decoder stripped, encoder-only with a mean-pool classification head.
Used in TradutorBUDA to re-rank dictionary lookup results: given a Tibetan term and its Tibetan sentence context, the model scores each candidate English gloss and surfaces the contextually correct meaning first.
Model Files
| File | Size | Description |
|---|---|---|
model.onnx |
399 MB | INT8 quantized ONNX (deployed) |
sentencepiece.bpe.model |
4.6 MB | NLLB-200 SentencePiece tokenizer |
tokenizer.json |
30.8 MB | HuggingFace tokenizer |
tokenizer_config.json |
— | Tokenizer config |
Architecture
- Backbone: M2M100 encoder from
TenzinGayche/nllb_600M_bi_boen_gold(~413M params, d_model=1024). Decoder is discarded. - Pooling: Mean-pool over attended encoder tokens
- Head: Dropout(0.1) → Linear(1024, 2)
- Scoring:
sigmoid(logits[:, 1])— higher = better translation match
Why NLLB encoder: bert-base-tibetan is monolingual and ModernBERT-large has negligible Tibetan vocabulary, causing cross-lingual mismatch between Tibetan query and English gloss. TenzinGayche/nllb_600M_bi_boen_gold already holds rich cross-lingual representations for exactly this domain (Buddhist Tibetan↔English), making it the natural backbone for cross-encoder ranking.
Input Format
[bod_Tibt] context_tibetan + tibetan_term [eos] gloss_tokens [eos]
- Language token
[bod_Tibt]is prepended (token id: 256030) - Tibetan context + term encoded up to 400 tokens, then
[eos]as separator - English gloss appended up to 100 tokens, then final
[eos] - Total cap: 512 tokens
Results (INT8 ONNX, CPU)
| Metric | Score |
|---|---|
| Term-only P@1 | 0.4480 |
| Term+context P@1 | 0.6619 |
| Avg latency | 72 ms / query |
| Random baseline P@1 | 0.1191 |
P@1 = the correct gloss is ranked first. Evaluated on 346 terms / 1,053 context groups from a held-out 20% validation split.
Note: ModernBERT v4b (trained with English context) reached term-only P@1=0.7081 / term+ctx P@1=0.8082 but at 1,379 ms/query (19× slower). This NLLB model runs in Tibetan-only context and is 19× faster.
Training Data
training_v2_merged.csv (34.6 MB, Tier 1 — included in this repo)
Columns: context_english, context_tibetan, tibetan_term, gloss, label
- 71,050 rows — 386 unique Tibetan terms, 2,942 unique English glosses
- Positive (label=1): 6,885 rows (9.7%) — correct translation in context
- Negative (label=0): 64,165 rows (90.3%) — incorrect/irrelevant gloss
- Imbalance: 9.3:1 (handled by pos_weight in BCEWithLogitsLoss)
- Split: 80% train (56,840) / 20% val (14,210), stratified by label
Data sources: Rangjung Yeshe Tibetan–English dictionary terms with Tibetan sentence contexts. Positive/negative pairs generated from multi-gloss entries. Tier 1 = human-reviewed and surgically corrected.
Training Configuration
- Base model:
TenzinGayche/nllb_600M_bi_boen_gold - Loss: BCEWithLogitsLoss with pos_weight = (9.3)^0.58 ≈ 3.09
- Optimizer: AdamW (lr=1.4e-5, weight_decay=0.02)
- LR schedule: 600 warmup steps
- Batch size: 16 with GroupSampler (positive + negatives for same term batched together)
- Epochs: up to 20, early stopping patience=10 eval steps
- HP search: 8 Optuna trials optimizing term-only P@1
- Mixed precision: BF16 on NVIDIA A100 (40 GB)
- ONNX export: FP32 (1,658 MB) → INT8 dynamic quantization (399 MB, 4× compression)
Usage (ONNX)
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("trabten/tibetan_ranking_BUDA")
tokenizer.src_lang = "bod_Tibt"
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
def rank(context_tibetan, tibetan_term, candidates):
eos_id = tokenizer.eos_token_id
pad_id = tokenizer.pad_token_id
all_ids, all_mask = [], []
for gloss in candidates:
enc_a = tokenizer(context_tibetan + " " + tibetan_term,
add_special_tokens=True, truncation=True, max_length=400)
enc_b = tokenizer.encode(gloss, add_special_tokens=False)[:100]
ids = (enc_a["input_ids"] + enc_b + [eos_id])[:512]
all_ids.append(ids); all_mask.append([1] * len(ids))
max_len = max(len(x) for x in all_ids)
input_ids = np.array([x + [pad_id]*(max_len-len(x)) for x in all_ids], dtype=np.int64)
attn_mask = np.array([x + [0]*(max_len-len(x)) for x in all_mask], dtype=np.int64)
logits = session.run(["logits"], {"input_ids": input_ids, "attention_mask": attn_mask})[0]
probs = 1 / (1 + np.exp(-logits[:, 1]))
return sorted(zip(candidates, probs), key=lambda x: x[1], reverse=True)
results = rank(
"བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པའི་ལམ་ལ་བཤད་པ།",
"སེམས་དཔའི་",
["bodhisattva", "mind", "being", "hero", "aspirant"]
)
for gloss, score in results:
print(f" {score:.3f} {gloss}")
Background
Classical Tibetan dictionary terms often carry 5–20 English glosses across different translation traditions (e.g. སེམས་དཔའ = "bodhisattva", "hero of mind", "being of awakening", ...). Without context, all are equally valid. This model re-ranks candidates by attending jointly to both the Tibetan sentence context and the English gloss, surfacing the contextually correct meaning.
References
- NLLB Team. "No Language Left Behind." Meta AI (2022). https://arxiv.org/abs/2207.04672
- TenzinGayche.
nllb_600M_bi_boen_gold— NLLB fine-tuned on Buddhist Tibetan↔English gold data. - Rangjung Yeshe Wiki — Tibetan–English dictionary source data.
Model tree for trabten/tibetan_ranking_BUDA
Base model
TenzinGayche/nllb_600M_bi_boen_gold