bod_uchen — Fine-Tuned Tesseract Model for Tibetan Uchen Script
A fine-tuned Tesseract OCR model for recognizing Tibetan Uchen (དབུ་ཅན་) script, trained on 121K+ line-level image–text pairs from historical Tibetan woodblock prints.
Model Details
| Field | Value |
|---|---|
| Model name | bod_uchen |
| Base model | bod.traineddata from tessdata_best |
| Script | Tibetan Uchen (དབུ་ཅན་) |
| Language | Tibetan (bod) |
| Engine | Tesseract LSTM |
| PSM | 13 (Raw line) |
| Epochs | 5 |
| Learning rate | 0.001 |
| Target error rate | 0.005 |
Training Data
The model was trained on a curated subset of Tibetan OCR line-level images paired with ground-truth text transcriptions, formatted for Tesseract training. The dataset is derived from openpecha/OCR-Tibetan_line_to_text_benchmark and prepared by the bo_tessaract_data_prep pipeline.
| Split | Directory | Samples |
|---|---|---|
| Train | train-ground-truth/ |
96,978 |
| Eval (Val) | val-ground-truth/ |
12,123 |
| Test | test-ground-truth/ |
12,125 |
| Total | 121,226 |
Training Results
Training converged at iteration 526,010 / 545,515 total:
| Metric | Value |
|---|---|
| Best BCER (selected model) | 9.719% |
| Final BCER (train) | 10.690% |
| Mean RMS | 1.077% |
| Delta | 2.498% |
| Skip ratio | 0.000% |
Collections Included
Five historical Tibetan text collections are included from the source dataset:
| Collection | Description |
|---|---|
| Lithang_Kanjur | Lithang edition of the Tibetan Buddhist canon (Kangyur) |
| Derge_Tenjur | Derge edition of the Tengyur commentarial collection |
| Lhasa_Kanjur | Lhasa edition of the Kangyur |
| KhyentseWangpo | Writings of Jamyang Khyentse Wangpo |
| Karmapa8 | Texts attributed to the 8th Karmapa, Mikyö Dorje |
Usage
Prerequisites
Install Tesseract 5.x:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt install tesseract-ocr
Download & Install the Model
Download bod_uchen.traineddata from this repository and place it in your Tesseract tessdata directory:
# Find your tessdata path
tesseract --print-parameters | grep tessdata
# Typical locations:
# macOS (Homebrew): /opt/homebrew/share/tessdata/
# Linux: /usr/share/tesseract-ocr/5/tessdata/
Run OCR
tesseract input_image.png output -l bod_uchen --psm 13
For multi-line page segmentation:
tesseract input_image.png output -l bod_uchen --psm 6
Python (pytesseract)
import pytesseract
from PIL import Image
image = Image.open("tibetan_line.png")
text = pytesseract.image_to_string(image, lang="bod_uchen", config="--psm 13")
print(text)
Training Command
The model was fine-tuned using tesstrain:
make training \
MODEL_NAME=bod_uchen \
START_MODEL=bod \
TESSDATA=data \
GROUND_TRUTH_DIR=data/bo_tesseract \
LANG_TYPE=Indic \
PSM=13 \
EPOCHS=5 \
LEARNING_RATE=0.001 \
TARGET_ERROR_RATE=0.005 \
WORDLIST_FILE=data/langdata/bod/bod.wordlist \
NUMBERS_FILE=data/langdata/bod/bod.numbers \
PUNC_FILE=data/langdata/bod/bod.punc \
2>&1 | tee -a data/bod_uchen/training.log
Intended Use
- OCR of historical Tibetan woodblock-printed texts in Uchen script
- Digitization of Kangyur, Tengyur, and other classical Tibetan collections
- Building searchable archives of Tibetan Buddhist literature
Limitations
- Optimized for Uchen (དབུ་ཅན་) woodblock print style; may underperform on Umé (དབུ་མེད་) or handwritten Tibetan
- Trained primarily on the five collections listed above; generalization to other print styles or modern typeset Tibetan may vary
- Best results when input images are pre-segmented into individual text lines (PSM 13)
Citation
If you use this model, please cite:
@misc{bod_uchen_tesseract,
title = {bod_uchen: Fine-Tuned Tesseract Model for Tibetan Uchen Script},
author = {Buddhist Digital Resource Center (BDRC) and Dharmaduta},
year = {2025},
url = {https://huggingface.co/bdrc/bod_uchen_tesseract},
note = {Fine-tuned from tessdata_best/bod.traineddata, funded by the Khyentse Foundation}
}
Acknowledgements
This model was developed by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.