Image-to-Text
Tibetan
tesseract
ocr
tibetan
uchen
fine-tuned

bod_uchen — Fine-Tuned Tesseract Model for Tibetan Uchen Script

A fine-tuned Tesseract OCR model for recognizing Tibetan Uchen (དབུ་ཅན་) script, trained on 121K+ line-level image–text pairs from historical Tibetan woodblock prints.

Model Details

Field Value
Model name bod_uchen
Base model bod.traineddata from tessdata_best
Script Tibetan Uchen (དབུ་ཅན་)
Language Tibetan (bod)
Engine Tesseract LSTM
PSM 13 (Raw line)
Epochs 5
Learning rate 0.001
Target error rate 0.005

Training Data

The model was trained on a curated subset of Tibetan OCR line-level images paired with ground-truth text transcriptions, formatted for Tesseract training. The dataset is derived from openpecha/OCR-Tibetan_line_to_text_benchmark and prepared by the bo_tessaract_data_prep pipeline.

Split Directory Samples
Train train-ground-truth/ 96,978
Eval (Val) val-ground-truth/ 12,123
Test test-ground-truth/ 12,125
Total 121,226

Training Results

Training converged at iteration 526,010 / 545,515 total:

Metric Value
Best BCER (selected model) 9.719%
Final BCER (train) 10.690%
Mean RMS 1.077%
Delta 2.498%
Skip ratio 0.000%

Collections Included

Five historical Tibetan text collections are included from the source dataset:

Collection Description
Lithang_Kanjur Lithang edition of the Tibetan Buddhist canon (Kangyur)
Derge_Tenjur Derge edition of the Tengyur commentarial collection
Lhasa_Kanjur Lhasa edition of the Kangyur
KhyentseWangpo Writings of Jamyang Khyentse Wangpo
Karmapa8 Texts attributed to the 8th Karmapa, Mikyö Dorje

Usage

Prerequisites

Install Tesseract 5.x:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

Download & Install the Model

Download bod_uchen.traineddata from this repository and place it in your Tesseract tessdata directory:

# Find your tessdata path
tesseract --print-parameters | grep tessdata

# Typical locations:
#   macOS (Homebrew): /opt/homebrew/share/tessdata/
#   Linux:           /usr/share/tesseract-ocr/5/tessdata/

Run OCR

tesseract input_image.png output -l bod_uchen --psm 13

For multi-line page segmentation:

tesseract input_image.png output -l bod_uchen --psm 6

Python (pytesseract)

import pytesseract
from PIL import Image

image = Image.open("tibetan_line.png")
text = pytesseract.image_to_string(image, lang="bod_uchen", config="--psm 13")
print(text)

Training Command

The model was fine-tuned using tesstrain:

make training \
  MODEL_NAME=bod_uchen \
  START_MODEL=bod \
  TESSDATA=data \
  GROUND_TRUTH_DIR=data/bo_tesseract \
  LANG_TYPE=Indic \
  PSM=13 \
  EPOCHS=5 \
  LEARNING_RATE=0.001 \
  TARGET_ERROR_RATE=0.005 \
  WORDLIST_FILE=data/langdata/bod/bod.wordlist \
  NUMBERS_FILE=data/langdata/bod/bod.numbers \
  PUNC_FILE=data/langdata/bod/bod.punc \
  2>&1 | tee -a data/bod_uchen/training.log

Intended Use

  • OCR of historical Tibetan woodblock-printed texts in Uchen script
  • Digitization of Kangyur, Tengyur, and other classical Tibetan collections
  • Building searchable archives of Tibetan Buddhist literature

Limitations

  • Optimized for Uchen (དབུ་ཅན་) woodblock print style; may underperform on Umé (དབུ་མེད་) or handwritten Tibetan
  • Trained primarily on the five collections listed above; generalization to other print styles or modern typeset Tibetan may vary
  • Best results when input images are pre-segmented into individual text lines (PSM 13)

Citation

If you use this model, please cite:

@misc{bod_uchen_tesseract,
  title   = {bod_uchen: Fine-Tuned Tesseract Model for Tibetan Uchen Script},
  author  = {Buddhist Digital Resource Center (BDRC) and Dharmaduta},
  year    = {2025},
  url     = {https://huggingface.co/bdrc/bod_uchen_tesseract},
  note    = {Fine-tuned from tessdata_best/bod.traineddata, funded by the Khyentse Foundation}
}

Acknowledgements

This model was developed by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train BDRC/Bod_uchen_tesseract