Bo-Boundary-mmBert

A ModernBERT model fine-tuned on Tibetan text for text-boundary detection (token classification). Each token is classified as either a boundary token (B) or a non-boundary token (O).

Model Details

Property	Value
Base model	jhu-clsp/mmBERT-base
Architecture	`ModernBertForTokenClassification`
Hidden size	768
Layers	22
Attention heads	12
Max sequence length	8,192
Vocab size	256,000
Labels	`O` (non-boundary), `B` (boundary)

Training

Data

Split	Windows
Train (raw)	62,416
Train (after negative sampling @ 0.1)	38,610
Validation	5,874

The boundary label is extremely rare — B : O ratio is 1 : 638 — so negative window sampling (ratio 0.1) and focal loss were used to handle the class imbalance.

Hyperparameters

Parameter	Value
Epochs	5
Batch size (per GPU)	1
Gradient accumulation steps	16
Effective batch size	16
Learning rate	2 × 10^-5
Warmup steps	1,206 (10 %)
Total optimisation steps	12,065
Loss	Focal Loss (gamma = 1.5, alpha = [O: 0.1, B: 0.9])
Early stopping patience	15 evals
Mixed precision	bfloat16
torch.compile	enabled

Infrastructure

GPU: 1 × NVIDIA GeForce RTX 4090 (23.5 GB)
Training time: ~24 h 52 min

Training Curve

The model was evaluated every 200 optimisation steps. Key milestones:

Step	Epoch	Train Loss	Val F1	Val F2	Val Precision	Val Recall
200	1	0.0278	0.069	0.129	0.039	0.308
1,200	1	0.0051	0.370	0.521	0.249	0.717
2,600	2	0.0002	0.391	0.574	0.255	0.834
3,600	2	0.0002	0.396	0.581	0.258	0.846
5,400	3	0.0002	0.428	0.610	0.286	0.850
7,200	3	0.0002	0.445	0.627	0.300	0.862
9,200	4	0.0001	0.495	0.648	0.355	0.817
9,800	5	0.0001	0.496	0.651	0.355	0.823
12,065	5	0.0001	0.479	0.637	0.340	0.815

Best checkpoint selected by F2 score: 0.6510 at step 9,800 (epoch 5).

Evaluation on Benchmark

The best checkpoint was evaluated on 30 held-out benchmark documents with tolerance = 25 and threshold = 0.75.

Aggregate Metrics

Metric	Value
Micro Precision	0.644
Micro Recall	0.854
Micro F1	0.734
Macro F1	0.711
True Positives	543
False Positives	300
False Negatives	93
Total Predicted	843
Total True	636

Per-Document Highlights

Document	Precision	Recall	F1
W8LS76156 (google_books)	0.931	1.000	0.964
W1KG16597 (google_books)	0.893	1.000	0.943
W1KG22443 (ocrv1)	0.886	0.987	0.934
W8LS31006 (ocrv1)	1.000	0.875	0.933
W3CN3089 (ocrv1)	0.864	1.000	0.927
W3KG466 (ocrv1)	0.882	0.938	0.909
W3KG439 (ocrv1)	0.767	0.959	0.853
IE3CN3396 (tei)	0.778	0.875	0.824

The model performs best on clean sources (google_books, ocrv1) and can struggle on noisier OCR or pages with no true boundaries (false-positive predictions on zero-boundary pages).

Limitations

The model was trained on a specific corpus of Tibetan texts; generalisation to unseen document styles or OCR engines may vary.
High recall (0.854) but moderate precision (0.644) means the model tends to over-predict boundaries — downstream consumers should apply confidence thresholding.
Pages with zero true boundaries can produce false-positive boundary predictions.

Credits

This model was trained by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for BDRC/Bo-Boundary-mmBert

Base model

jhu-clsp/mmBERT-base

Finetuned

(109)

this model