Tibetan Word Segmentation BUDA

Fine-tuned KoichiYasuoka/bert-base-tibetan for syllable-level Tibetan word segmentation. Predicts B (begin-word) or I (inside-word) for each syllable. Used as the segmentation engine in TradutorBUDA.

99.95% accuracy — B F1: 99.92% | I F1: 99.97% — on 20,903 validation sequences.

Model Files

File Size Description
model_quantized.onnx 104.5 MB INT8 quantized ONNX (deployed)
tokenizer.json 0.8 MB HuggingFace tokenizer
vocab.txt 0.4 MB Vocabulary
config.json Model configuration
segmentation_config.json Label mappings (B/I)
dictionary_words.json 16.6 MB 353K Tibetan words for constraint decoding

The FP32 ONNX (413 MB) is not included — use the INT8 model for CPU deployment.

Architecture

Base: KoichiYasuoka/bert-base-tibetan (BERT-base, 110M parameters, pre-trained on 4.63 GB Tibetan text) Task head: token classification → 2 labels (B / I)

Results

Metric Score
Accuracy 99.95%
B precision 99.97%
B recall 99.88%
B F1 99.92%
I precision 99.95%
I recall 99.99%
I F1 99.97%

Dataset

  • Train: 208,077 sequences — Val: 20,903 sequences
  • 200K synthetic sequences generated from 353K Tibetan dictionary terms (B/I labels derived from word boundaries)
  • 16K silver-standard sequences from Botok segmentation output
  • Input format: space-separated Tibetan syllables; each syllable ends with a tsheg (་)

Example:

Syllables: བྱང་  ཆུབ་  སེམས་  དཔའི་  སྤྱོད་  པ
Labels:    B     I     B      I      B      I
Words:     བྱང་ཆུབ་  /  སེམས་དཔའི་  /  སྤྱོད་པ

Training Configuration

  • Base model: KoichiYasuoka/bert-base-tibetan
  • Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
  • LR schedule: 10% linear warmup → constant
  • Epochs: 5 (early stopping patience=3; did not trigger)
  • Batch size: 64
  • Max sequence length: 256 syllable-tokens
  • Mixed precision: FP16 on NVIDIA A100 (40 GB)
  • Best checkpoint metric: B F1

Usage (ONNX / Java — TradutorBUDA)

The model is loaded by TibetanSegmentationService.java using ONNX Runtime:

OrtSession session = env.createSession("models/segmentation_onnx/model_quantized.onnx", opts);

Inputs: input_ids, attention_mask, token_type_ids (shape: [batch, 256])
Output: logits (shape: [batch, 256, 2]) — argmax gives B=0 / I=1 per token.

Usage (Python)

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
import json, re

tokenizer = AutoTokenizer.from_pretrained("trabten/tibetan_segmentation_BUDA")
session = ort.InferenceSession("model_quantized.onnx", providers=["CPUExecutionProvider"])

MAX_LENGTH = 256
ID2LABEL = {0: "B", 1: "I"}

def segment(text):
    syllables = [s for s in re.split(r'(?<=་)', text.strip()) if s.strip()]
    token_ids, word_ids = [], []
    for i, syl in enumerate(syllables):
        ids = tokenizer.encode(syl, add_special_tokens=False)
        token_ids.extend(ids)
        word_ids.extend([i] * len(ids))

    cls, sep = tokenizer.cls_token_id, tokenizer.sep_token_id
    input_ids = [cls] + token_ids[:MAX_LENGTH-2] + [sep]
    word_ids  = [-1]  + word_ids[:MAX_LENGTH-2]  + [-1]
    attn_mask = [1] * len(input_ids)
    tok_types = [0] * len(input_ids)

    pad = MAX_LENGTH - len(input_ids)
    input_ids += [tokenizer.pad_token_id] * pad
    attn_mask += [0] * pad
    tok_types += [0] * pad

    feeds = {
        "input_ids":      np.array([input_ids],  dtype=np.int64),
        "attention_mask": np.array([attn_mask],  dtype=np.int64),
        "token_type_ids": np.array([tok_types],  dtype=np.int64),
    }
    logits = session.run(None, feeds)[0][0]
    preds = logits.argmax(axis=-1)

    syl_labels, prev = [], None
    for wid, pred in zip(word_ids, preds):
        if wid == -1: continue
        if wid != prev:
            syl_labels.append(ID2LABEL[pred])
        prev = wid

    words, current = [], []
    for syl, lbl in zip(syllables, syl_labels):
        if lbl == "B" and current:
            words.append("".join(current))
            current = [syl]
        else:
            current.append(syl)
    if current:
        words.append("".join(current))
    return words

print(segment("བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ"))
# ['བྱང་ཆུབ་', 'སེམས་དཔའི་', 'སྤྱོད་པ']

Background

Tibetan has no spaces between words — segmentation is required before dictionary lookup, POS tagging, or machine translation. Traditional rule-based segmenters (e.g. Botok) rely on hand-curated lexicons and miss out-of-vocabulary terms. This model learns segmentation patterns from context, achieving near-perfect accuracy on Buddhist canonical vocabulary.

References

  • Yasuoka, K. "Universal Dependencies for Classical Tibetan." (base model)
  • Esukhia. Botok — Tibetan NLP toolkit (silver-standard data source)
  • Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers." (ICLR 2019)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for trabten/tibetan_segmentation_BUDA

Quantized
(1)
this model