Bo-Boundary-mmBert
A ModernBERT model fine-tuned on Tibetan text for text-boundary detection (token classification). Each token is classified as either a boundary token (B) or a non-boundary token (O).
Model Details
| Property | Value |
|---|---|
| Base model | jhu-clsp/mmBERT-base |
| Architecture | ModernBertForTokenClassification |
| Hidden size | 768 |
| Layers | 22 |
| Attention heads | 12 |
| Max sequence length | 8,192 |
| Vocab size | 256,000 |
| Labels | O (non-boundary), B (boundary) |
Training
Data
| Split | Windows |
|---|---|
| Train (raw) | 62,416 |
| Train (after negative sampling @ 0.1) | 38,610 |
| Validation | 5,874 |
The boundary label is extremely rare โ B : O ratio is 1 : 638 โ so negative window sampling (ratio 0.1) and focal loss were used to handle the class imbalance.
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 5 |
| Batch size (per GPU) | 1 |
| Gradient accumulation steps | 16 |
| Effective batch size | 16 |
| Learning rate | 2 ร 10-5 |
| Warmup steps | 1,206 (10 %) |
| Total optimisation steps | 12,065 |
| Loss | Focal Loss (gamma = 1.5, alpha = [O: 0.1, B: 0.9]) |
| Early stopping patience | 15 evals |
| Mixed precision | bfloat16 |
| torch.compile | enabled |
Infrastructure
- GPU: 1 ร NVIDIA GeForce RTX 4090 (23.5 GB)
- Training time: ~24 h 52 min
Training Curve
The model was evaluated every 200 optimisation steps. Key milestones:
| Step | Epoch | Train Loss | Val F1 | Val F2 | Val Precision | Val Recall |
|---|---|---|---|---|---|---|
| 200 | 1 | 0.0278 | 0.069 | 0.129 | 0.039 | 0.308 |
| 1,200 | 1 | 0.0051 | 0.370 | 0.521 | 0.249 | 0.717 |
| 2,600 | 2 | 0.0002 | 0.391 | 0.574 | 0.255 | 0.834 |
| 3,600 | 2 | 0.0002 | 0.396 | 0.581 | 0.258 | 0.846 |
| 5,400 | 3 | 0.0002 | 0.428 | 0.610 | 0.286 | 0.850 |
| 7,200 | 3 | 0.0002 | 0.445 | 0.627 | 0.300 | 0.862 |
| 9,200 | 4 | 0.0001 | 0.495 | 0.648 | 0.355 | 0.817 |
| 9,800 | 5 | 0.0001 | 0.496 | 0.651 | 0.355 | 0.823 |
| 12,065 | 5 | 0.0001 | 0.479 | 0.637 | 0.340 | 0.815 |
Best checkpoint selected by F2 score: 0.6510 at step 9,800 (epoch 5).
Evaluation on Benchmark
The best checkpoint was evaluated on 30 held-out benchmark documents with tolerance = 25 and threshold = 0.75.
Aggregate Metrics
| Metric | Value |
|---|---|
| Micro Precision | 0.644 |
| Micro Recall | 0.854 |
| Micro F1 | 0.734 |
| Macro F1 | 0.711 |
| True Positives | 543 |
| False Positives | 300 |
| False Negatives | 93 |
| Total Predicted | 843 |
| Total True | 636 |
Per-Document Highlights
| Document | Precision | Recall | F1 |
|---|---|---|---|
| W8LS76156 (google_books) | 0.931 | 1.000 | 0.964 |
| W1KG16597 (google_books) | 0.893 | 1.000 | 0.943 |
| W1KG22443 (ocrv1) | 0.886 | 0.987 | 0.934 |
| W8LS31006 (ocrv1) | 1.000 | 0.875 | 0.933 |
| W3CN3089 (ocrv1) | 0.864 | 1.000 | 0.927 |
| W3KG466 (ocrv1) | 0.882 | 0.938 | 0.909 |
| W3KG439 (ocrv1) | 0.767 | 0.959 | 0.853 |
| IE3CN3396 (tei) | 0.778 | 0.875 | 0.824 |
The model performs best on clean sources (google_books, ocrv1) and can struggle on noisier OCR or pages with no true boundaries (false-positive predictions on zero-boundary pages).
Limitations
- The model was trained on a specific corpus of Tibetan texts; generalisation to unseen document styles or OCR engines may vary.
- High recall (0.854) but moderate precision (0.644) means the model tends to over-predict boundaries โ downstream consumers should apply confidence thresholding.
- Pages with zero true boundaries can produce false-positive boundary predictions.
Credits
This model was trained by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.
Model tree for BDRC/Bo-Boundary-mmBert
Base model
jhu-clsp/mmBERT-base