Bo-Boundary-mmBert

A ModernBERT model fine-tuned on Tibetan text for text-boundary detection (token classification). Each token is classified as either a boundary token (B) or a non-boundary token (O).

Model Details

Property Value
Base model jhu-clsp/mmBERT-base
Architecture ModernBertForTokenClassification
Hidden size 768
Layers 22
Attention heads 12
Max sequence length 8,192
Vocab size 256,000
Labels O (non-boundary), B (boundary)

Training

Data

Split Windows
Train (raw) 62,416
Train (after negative sampling @ 0.1) 38,610
Validation 5,874

The boundary label is extremely rare โ€” B : O ratio is 1 : 638 โ€” so negative window sampling (ratio 0.1) and focal loss were used to handle the class imbalance.

Hyperparameters

Parameter Value
Epochs 5
Batch size (per GPU) 1
Gradient accumulation steps 16
Effective batch size 16
Learning rate 2 ร— 10-5
Warmup steps 1,206 (10 %)
Total optimisation steps 12,065
Loss Focal Loss (gamma = 1.5, alpha = [O: 0.1, B: 0.9])
Early stopping patience 15 evals
Mixed precision bfloat16
torch.compile enabled

Infrastructure

  • GPU: 1 ร— NVIDIA GeForce RTX 4090 (23.5 GB)
  • Training time: ~24 h 52 min

Training Curve

The model was evaluated every 200 optimisation steps. Key milestones:

Step Epoch Train Loss Val F1 Val F2 Val Precision Val Recall
200 1 0.0278 0.069 0.129 0.039 0.308
1,200 1 0.0051 0.370 0.521 0.249 0.717
2,600 2 0.0002 0.391 0.574 0.255 0.834
3,600 2 0.0002 0.396 0.581 0.258 0.846
5,400 3 0.0002 0.428 0.610 0.286 0.850
7,200 3 0.0002 0.445 0.627 0.300 0.862
9,200 4 0.0001 0.495 0.648 0.355 0.817
9,800 5 0.0001 0.496 0.651 0.355 0.823
12,065 5 0.0001 0.479 0.637 0.340 0.815

Best checkpoint selected by F2 score: 0.6510 at step 9,800 (epoch 5).

Evaluation on Benchmark

The best checkpoint was evaluated on 30 held-out benchmark documents with tolerance = 25 and threshold = 0.75.

Aggregate Metrics

Metric Value
Micro Precision 0.644
Micro Recall 0.854
Micro F1 0.734
Macro F1 0.711
True Positives 543
False Positives 300
False Negatives 93
Total Predicted 843
Total True 636

Per-Document Highlights

Document Precision Recall F1
W8LS76156 (google_books) 0.931 1.000 0.964
W1KG16597 (google_books) 0.893 1.000 0.943
W1KG22443 (ocrv1) 0.886 0.987 0.934
W8LS31006 (ocrv1) 1.000 0.875 0.933
W3CN3089 (ocrv1) 0.864 1.000 0.927
W3KG466 (ocrv1) 0.882 0.938 0.909
W3KG439 (ocrv1) 0.767 0.959 0.853
IE3CN3396 (tei) 0.778 0.875 0.824

The model performs best on clean sources (google_books, ocrv1) and can struggle on noisier OCR or pages with no true boundaries (false-positive predictions on zero-boundary pages).

Limitations

  • The model was trained on a specific corpus of Tibetan texts; generalisation to unseen document styles or OCR engines may vary.
  • High recall (0.854) but moderate precision (0.644) means the model tends to over-predict boundaries โ€” downstream consumers should apply confidence thresholding.
  • Pages with zero true boundaries can produce false-positive boundary predictions.

Credits

This model was trained by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for BDRC/Bo-Boundary-mmBert

Finetuned
(109)
this model