Tibetan Word Segmentation BUDA
Fine-tuned KoichiYasuoka/bert-base-tibetan for syllable-level Tibetan word segmentation. Predicts B (begin-word) or I (inside-word) for each syllable. Used as the segmentation engine in TradutorBUDA.
99.95% accuracy — B F1: 99.92% | I F1: 99.97% — on 20,903 validation sequences.
Model Files
| File | Size | Description |
|---|---|---|
model_quantized.onnx |
104.5 MB | INT8 quantized ONNX (deployed) |
tokenizer.json |
0.8 MB | HuggingFace tokenizer |
vocab.txt |
0.4 MB | Vocabulary |
config.json |
— | Model configuration |
segmentation_config.json |
— | Label mappings (B/I) |
dictionary_words.json |
16.6 MB | 353K Tibetan words for constraint decoding |
The FP32 ONNX (413 MB) is not included — use the INT8 model for CPU deployment.
Architecture
Base: KoichiYasuoka/bert-base-tibetan (BERT-base, 110M parameters, pre-trained on 4.63 GB Tibetan text)
Task head: token classification → 2 labels (B / I)
Results
| Metric | Score |
|---|---|
| Accuracy | 99.95% |
| B precision | 99.97% |
| B recall | 99.88% |
| B F1 | 99.92% |
| I precision | 99.95% |
| I recall | 99.99% |
| I F1 | 99.97% |
Dataset
- Train: 208,077 sequences — Val: 20,903 sequences
- 200K synthetic sequences generated from 353K Tibetan dictionary terms (B/I labels derived from word boundaries)
- 16K silver-standard sequences from Botok segmentation output
- Input format: space-separated Tibetan syllables; each syllable ends with a tsheg (་)
Example:
Syllables: བྱང་ ཆུབ་ སེམས་ དཔའི་ སྤྱོད་ པ
Labels: B I B I B I
Words: བྱང་ཆུབ་ / སེམས་དཔའི་ / སྤྱོད་པ
Training Configuration
- Base model:
KoichiYasuoka/bert-base-tibetan - Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
- LR schedule: 10% linear warmup → constant
- Epochs: 5 (early stopping patience=3; did not trigger)
- Batch size: 64
- Max sequence length: 256 syllable-tokens
- Mixed precision: FP16 on NVIDIA A100 (40 GB)
- Best checkpoint metric: B F1
Usage (ONNX / Java — TradutorBUDA)
The model is loaded by TibetanSegmentationService.java using ONNX Runtime:
OrtSession session = env.createSession("models/segmentation_onnx/model_quantized.onnx", opts);
Inputs: input_ids, attention_mask, token_type_ids (shape: [batch, 256])
Output: logits (shape: [batch, 256, 2]) — argmax gives B=0 / I=1 per token.
Usage (Python)
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
import json, re
tokenizer = AutoTokenizer.from_pretrained("trabten/tibetan_segmentation_BUDA")
session = ort.InferenceSession("model_quantized.onnx", providers=["CPUExecutionProvider"])
MAX_LENGTH = 256
ID2LABEL = {0: "B", 1: "I"}
def segment(text):
syllables = [s for s in re.split(r'(?<=་)', text.strip()) if s.strip()]
token_ids, word_ids = [], []
for i, syl in enumerate(syllables):
ids = tokenizer.encode(syl, add_special_tokens=False)
token_ids.extend(ids)
word_ids.extend([i] * len(ids))
cls, sep = tokenizer.cls_token_id, tokenizer.sep_token_id
input_ids = [cls] + token_ids[:MAX_LENGTH-2] + [sep]
word_ids = [-1] + word_ids[:MAX_LENGTH-2] + [-1]
attn_mask = [1] * len(input_ids)
tok_types = [0] * len(input_ids)
pad = MAX_LENGTH - len(input_ids)
input_ids += [tokenizer.pad_token_id] * pad
attn_mask += [0] * pad
tok_types += [0] * pad
feeds = {
"input_ids": np.array([input_ids], dtype=np.int64),
"attention_mask": np.array([attn_mask], dtype=np.int64),
"token_type_ids": np.array([tok_types], dtype=np.int64),
}
logits = session.run(None, feeds)[0][0]
preds = logits.argmax(axis=-1)
syl_labels, prev = [], None
for wid, pred in zip(word_ids, preds):
if wid == -1: continue
if wid != prev:
syl_labels.append(ID2LABEL[pred])
prev = wid
words, current = [], []
for syl, lbl in zip(syllables, syl_labels):
if lbl == "B" and current:
words.append("".join(current))
current = [syl]
else:
current.append(syl)
if current:
words.append("".join(current))
return words
print(segment("བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ"))
# ['བྱང་ཆུབ་', 'སེམས་དཔའི་', 'སྤྱོད་པ']
Background
Tibetan has no spaces between words — segmentation is required before dictionary lookup, POS tagging, or machine translation. Traditional rule-based segmenters (e.g. Botok) rely on hand-curated lexicons and miss out-of-vocabulary terms. This model learns segmentation patterns from context, achieving near-perfect accuracy on Buddhist canonical vocabulary.
References
- Yasuoka, K. "Universal Dependencies for Classical Tibetan." (base model)
- Esukhia. Botok — Tibetan NLP toolkit (silver-standard data source)
- Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers." (ICLR 2019)
- Downloads last month
- -
Model tree for trabten/tibetan_segmentation_BUDA
Base model
Pagewood/Tibetan-BERT-wwm