tat_cyrl — Tatar (Cyrillic) OCR model for Tesseract 5

A Tesseract 5 LSTM model that reads printed Tatar in Cyrillic, including the six Tatar-specific letters ә ө ү җ ң һ. Trained for digitizing Tatar books and periodicals.

⚠️ The official tesseract -l tat is a different script — use it only for Latin Tatar. The bundled tat model is Latin-script (Jaꞑalif-era), so it's the right choice for historical Latin/Jaꞑalif Tatar — but on modern Cyrillic Tatar it produces garbage. This model (tat_cyrl) is the Cyrillic counterpart, fine-tuned from the Russian (rus) model.

Results

External benchmark — yasalma/tatar-ocr-benchmark (151 real document pages, region-level, micro-averaged CER):

Category	CER	WER
books	1.84 %	7.7 %
legal	2.78 %	8.9 %
periodicals	3.45 %	12.6 %
misc* (web/graphics/forms)†	18.4 %	27.9 %
Overall	2.84 %	10.0 %

misc is a catch-all of non-prose pages — web/news link lists, certificates & posters, tables/forms — typically full of Latin URLs and decorative fonts; see Intended use & limitations below.

Internal held-out set of 225 real book lines (books not in training; GT = the publishers' digital text layer):

input	CER	WER
clean print	0.73 %	5.7 %
scan-degraded	3.10 %	13.7 %

On real library scans the output is clean, fluent Tatar with correct ә ө ү җ ң һ. See BENCHMARK.md for details, caveats, and a head-to-head against PP-OCRv5 (cyrillic) and Marker/Surya.

Usage

# put tat_cyrl.traineddata in a tessdata directory
export TESSDATA_PREFIX=/path/to/dir/containing/tat_cyrl.traineddata
tesseract page.png stdout -l tat_cyrl --psm 3        # full page
tesseract line.png stdout -l tat_cyrl --psm 7        # single line

Python:

import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("page.png"), lang="tat_cyrl", config="--psm 3")

Requires Tesseract ≥ 5.0 (LSTM engine). 300 DPI input is recommended for scans.

How it was trained

Two-stage fine-tuning from the Apache-2.0 rus (Russian) tessdata_best model, which provides a strong Cyrillic foundation; we extend the character set with the six Tatar letters and adapt to Tatar typography and real-scan noise.

Stage A — synthetic warm-up (tat_cyrl_a)

gen_synth.py: render corpus text as line images in 20 Tatar-capable fonts (each verified via the font cmap to actually encode ә ө ү җ ң һ — not just render tofu), with light augmentation (blur, ±1.3° rotation, background/contrast jitter). 16 000 line/ground-truth pairs.
fine_tune.sh: tesstrain fine-tune rus → tat_cyrl_a (char-set extension via --continue_from/--old_traineddata). Converges to ~0.5 % CER on the synthetic set.

Stage B — real typography + scan robustness (tat_cyrl)

Real book lines (data/real_lines/): 2 550 real line-image + ground-truth pairs extracted from born-digital Tatar PDFs via their text-layer line coordinates (PyMuPDF), cleaned of soft hyphens — real publisher fonts with perfect labels, the same typography as real scans.
build_train_data.py: build a 12 650-line mix = real lines × (1 clean + 2 scan-degraded) + 5 000 scan-degraded synthetic lines. Degradation models real scans: blur, Gaussian noise, JPEG artifacts, grey aged-paper background, ±1.6° skew, and ink erosion/dilation.
fine_tune2.sh: continued fine-tune tat_cyrl_a → tat_cyrl on the mix (15 k iter, LR 5e-4).

This lifted scan-degraded accuracy (char-similarity 0.83 → 0.92) with no regression on clean print.

Reproduce

training/build_tess.sh          # build Tesseract 5.5 + training tools from source
training/gen_synth.py           # 16k synthetic line images (from your Tatar text corpus)
training/fine_tune.sh           # stage A: rus -> tat_cyrl_a
training/build_train_data.py    # 12.6k real+augmented mix (consumes data/real_lines/)
training/fine_tune2.sh          # stage B: tat_cyrl_a -> tat_cyrl
eval/cer.py                     # line-level CER/WER on a held-out set

ocr/clean_boilerplate.py is a post-OCR cleanup utility: line-level removal of imprint/colophon, table-of-contents, page numbers, and garbled URL/code lines, keeping prose — useful for tidying model output before downstream use.

Intended use & limitations

Use for: printed modern Tatar (Cyrillic) — books, literary periodicals, legal documents, single-column prose.
Weaker on: heavily decorated/illustrated magazine layouts, multi-column poetry, very degraded or skewed scans, and historical Arabic-script or Jaꞑalif (Latin) Tatar (different scripts — out of scope). A per-page confidence threshold screens these out easily.
Graphic / non-prose layouts and embedded Latin are mangled — this is a Cyrillic-only, prose-oriented model. It is not meant for the grab-bag of "everything that isn't a printed page", and on the benchmark this is the entire reason the misc bucket scores ~18 % CER while books / periodicals / legal stay ≤3.5 %. That bucket is a catch-all of, for example:
- web pages, news-link lists, social-media posts — full of Latin URLs and handles the model can't read;
- certificates, diplomas, posters, flyers, advertisements — decorative display fonts, logos, watermark/background textures, stylised word-art;
- infographics, tables and forms (e.g. multi-column lesson plans) — grid / non-linear reading order the line model isn't built for;
- SMS / parking / public notices and anything mixing the above with embedded Latin, e-mails, phone numbers or addresses.
For document / book / periodical OCR none of this applies; for such graphic or web/social content, run a layout-aware detector first and strip or post-process the Latin / URL tokens separately.

License

Apache-2.0. Tesseract is Apache-2.0; this model is fine-tuned from the Apache-2.0 rus tessdata_best model; the training/eval/OCR scripts in this repo are released under Apache-2.0.

Citation

@misc{tat_cyrl_2026,
  author = {Ilshat Saetov},
  title  = {tat_cyrl: a Tesseract OCR model for Cyrillic Tatar},
  year   = {2026},
  note   = {Fine-tuned from Tesseract rus; synthetic + real book lines with scan augmentation},
  url    = {https://huggingface.co/yasalma/tatar-ocr-tesseract}
}

Downloads last month: -