bod_uchen — Fine-Tuned Tesseract Model for Tibetan Uchen Script

A fine-tuned Tesseract OCR model for recognizing Tibetan Uchen (དབུ་ཅན་) script, trained on 121K+ line-level image–text pairs from historical Tibetan woodblock prints.

Model Details

Field	Value
Model name	`bod_uchen`
Base model	`bod.traineddata` from `tessdata_best`
Script	Tibetan Uchen (དབུ་ཅན་)
Language	Tibetan (bod)
Engine	Tesseract LSTM
PSM	13 (Raw line)
Epochs	5
Learning rate	0.001
Target error rate	0.005

Training Data

The model was trained on a curated subset of Tibetan OCR line-level images paired with ground-truth text transcriptions, formatted for Tesseract training. The dataset is derived from openpecha/OCR-Tibetan_line_to_text_benchmark and prepared by the bo_tessaract_data_prep pipeline.

Split	Directory	Samples
Train	`train-ground-truth/`	96,978
Eval (Val)	`val-ground-truth/`	12,123
Test	`test-ground-truth/`	12,125
Total		121,226

Training Results

Training converged at iteration 526,010 / 545,515 total:

Metric	Value
Best BCER (selected model)	9.719%
Final BCER (train)	10.690%
Mean RMS	1.077%
Delta	2.498%
Skip ratio	0.000%

Collections Included

Five historical Tibetan text collections are included from the source dataset:

Collection	Description
Lithang_Kanjur	Lithang edition of the Tibetan Buddhist canon (Kangyur)
Derge_Tenjur	Derge edition of the Tengyur commentarial collection
Lhasa_Kanjur	Lhasa edition of the Kangyur
KhyentseWangpo	Writings of Jamyang Khyentse Wangpo
Karmapa8	Texts attributed to the 8th Karmapa, Mikyö Dorje

Usage

Prerequisites

Install Tesseract 5.x:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

Download & Install the Model

Download bod_uchen.traineddata from this repository and place it in your Tesseract tessdata directory:

# Find your tessdata path
tesseract --print-parameters | grep tessdata

# Typical locations:
#   macOS (Homebrew): /opt/homebrew/share/tessdata/
#   Linux:           /usr/share/tesseract-ocr/5/tessdata/

Run OCR

tesseract input_image.png output -l bod_uchen --psm 13

For multi-line page segmentation:

tesseract input_image.png output -l bod_uchen --psm 6

Python (pytesseract)

import pytesseract
from PIL import Image

image = Image.open("tibetan_line.png")
text = pytesseract.image_to_string(image, lang="bod_uchen", config="--psm 13")
print(text)

Training Command

The model was fine-tuned using tesstrain:

make training \
  MODEL_NAME=bod_uchen \
  START_MODEL=bod \
  TESSDATA=data \
  GROUND_TRUTH_DIR=data/bo_tesseract \
  LANG_TYPE=Indic \
  PSM=13 \
  EPOCHS=5 \
  LEARNING_RATE=0.001 \
  TARGET_ERROR_RATE=0.005 \
  WORDLIST_FILE=data/langdata/bod/bod.wordlist \
  NUMBERS_FILE=data/langdata/bod/bod.numbers \
  PUNC_FILE=data/langdata/bod/bod.punc \
  2>&1 | tee -a data/bod_uchen/training.log

Intended Use

OCR of historical Tibetan woodblock-printed texts in Uchen script
Digitization of Kangyur, Tengyur, and other classical Tibetan collections
Building searchable archives of Tibetan Buddhist literature

Limitations

Optimized for Uchen (དབུ་ཅན་) woodblock print style; may underperform on Umé (དབུ་མེད་) or handwritten Tibetan
Trained primarily on the five collections listed above; generalization to other print styles or modern typeset Tibetan may vary
Best results when input images are pre-segmented into individual text lines (PSM 13)

Citation

If you use this model, please cite:

@misc{bod_uchen_tesseract,
  title   = {bod_uchen: Fine-Tuned Tesseract Model for Tibetan Uchen Script},
  author  = {Buddhist Digital Resource Center (BDRC) and Dharmaduta},
  year    = {2025},
  url     = {https://huggingface.co/bdrc/bod_uchen_tesseract},
  note    = {Fine-tuned from tessdata_best/bod.traineddata, funded by the Khyentse Foundation}
}

Acknowledgements

This model was developed by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.

Downloads last month: -; Downloads are not tracked for this model. How to track

BDRC
/

Bod_uchen_tesseract