tat_cyrl — Tatar (Cyrillic) OCR model for Tesseract 5
A Tesseract 5 LSTM model that reads printed Tatar in Cyrillic, including the six Tatar-specific letters ә ө ү җ ң һ. Trained for digitizing Tatar books and periodicals.
⚠️ The official tesseract -l tat is a different script — use it only for Latin Tatar.
The bundled tat model is Latin-script (Jaꞑalif-era), so it's the right choice for historical
Latin/Jaꞑalif Tatar — but on modern Cyrillic Tatar it produces garbage. This model
(tat_cyrl) is the Cyrillic counterpart, fine-tuned from the Russian (rus) model.
Results
External benchmark — yasalma/tatar-ocr-benchmark
(151 real document pages, region-level, micro-averaged CER):
| Category | CER | WER |
|---|---|---|
| books | 1.84 % | 7.7 % |
| legal | 2.78 % | 8.9 % |
| periodicals | 3.45 % | 12.6 % |
| misc* (web/graphics/forms)† | 18.4 % | 27.9 % |
| Overall | 2.84 % | 10.0 % |
misc is a catch-all of non-prose pages — web/news link lists, certificates & posters,
tables/forms — typically full of Latin URLs and decorative fonts; see Intended use & limitations below.
Internal held-out set of 225 real book lines (books not in training; GT = the publishers' digital text layer):
| input | CER | WER |
|---|---|---|
| clean print | 0.73 % | 5.7 % |
| scan-degraded | 3.10 % | 13.7 % |
On real library scans the output is clean, fluent Tatar with correct ә ө ү җ ң һ. See BENCHMARK.md for details, caveats, and a head-to-head
against PP-OCRv5 (cyrillic) and Marker/Surya.
Usage
# put tat_cyrl.traineddata in a tessdata directory
export TESSDATA_PREFIX=/path/to/dir/containing/tat_cyrl.traineddata
tesseract page.png stdout -l tat_cyrl --psm 3 # full page
tesseract line.png stdout -l tat_cyrl --psm 7 # single line
Python:
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("page.png"), lang="tat_cyrl", config="--psm 3")
Requires Tesseract ≥ 5.0 (LSTM engine). 300 DPI input is recommended for scans.
How it was trained
Two-stage fine-tuning from the Apache-2.0 rus (Russian) tessdata_best model, which provides a
strong Cyrillic foundation; we extend the character set with the six Tatar letters and adapt to
Tatar typography and real-scan noise.
Stage A — synthetic warm-up (tat_cyrl_a)
gen_synth.py: render corpus text as line images in 20 Tatar-capable fonts (each verified via the fontcmapto actually encode ә ө ү җ ң һ — not just render tofu), with light augmentation (blur, ±1.3° rotation, background/contrast jitter). 16 000 line/ground-truth pairs.fine_tune.sh:tesstrainfine-tunerus → tat_cyrl_a(char-set extension via--continue_from/--old_traineddata). Converges to ~0.5 % CER on the synthetic set.
Stage B — real typography + scan robustness (tat_cyrl)
- Real book lines (
data/real_lines/): 2 550 real line-image + ground-truth pairs extracted from born-digital Tatar PDFs via their text-layer line coordinates (PyMuPDF), cleaned of soft hyphens — real publisher fonts with perfect labels, the same typography as real scans. build_train_data.py: build a 12 650-line mix = real lines × (1 clean + 2 scan-degraded) + 5 000 scan-degraded synthetic lines. Degradation models real scans: blur, Gaussian noise, JPEG artifacts, grey aged-paper background, ±1.6° skew, and ink erosion/dilation.fine_tune2.sh: continued fine-tunetat_cyrl_a → tat_cyrlon the mix (15 k iter, LR 5e-4).
This lifted scan-degraded accuracy (char-similarity 0.83 → 0.92) with no regression on clean print.
Reproduce
training/build_tess.sh # build Tesseract 5.5 + training tools from source
training/gen_synth.py # 16k synthetic line images (from your Tatar text corpus)
training/fine_tune.sh # stage A: rus -> tat_cyrl_a
training/build_train_data.py # 12.6k real+augmented mix (consumes data/real_lines/)
training/fine_tune2.sh # stage B: tat_cyrl_a -> tat_cyrl
eval/cer.py # line-level CER/WER on a held-out set
ocr/clean_boilerplate.py is a post-OCR cleanup utility: line-level removal of imprint/colophon,
table-of-contents, page numbers, and garbled URL/code lines, keeping prose — useful for tidying
model output before downstream use.
Intended use & limitations
Use for: printed modern Tatar (Cyrillic) — books, literary periodicals, legal documents, single-column prose.
Weaker on: heavily decorated/illustrated magazine layouts, multi-column poetry, very degraded or skewed scans, and historical Arabic-script or Jaꞑalif (Latin) Tatar (different scripts — out of scope). A per-page confidence threshold screens these out easily.
Graphic / non-prose layouts and embedded Latin are mangled — this is a Cyrillic-only, prose-oriented model. It is not meant for the grab-bag of "everything that isn't a printed page", and on the benchmark this is the entire reason the
miscbucket scores ~18 % CER while books / periodicals / legal stay ≤3.5 %. That bucket is a catch-all of, for example:- web pages, news-link lists, social-media posts — full of Latin URLs and handles the model can't read;
- certificates, diplomas, posters, flyers, advertisements — decorative display fonts, logos, watermark/background textures, stylised word-art;
- infographics, tables and forms (e.g. multi-column lesson plans) — grid / non-linear reading order the line model isn't built for;
- SMS / parking / public notices and anything mixing the above with embedded Latin, e-mails, phone numbers or addresses.
For document / book / periodical OCR none of this applies; for such graphic or web/social content, run a layout-aware detector first and strip or post-process the Latin / URL tokens separately.
License
Apache-2.0. Tesseract is Apache-2.0; this model is fine-tuned from the Apache-2.0 rus
tessdata_best model; the training/eval/OCR scripts in this repo are released under Apache-2.0.
Citation
@misc{tat_cyrl_2026,
author = {Ilshat Saetov},
title = {tat_cyrl: a Tesseract OCR model for Cyrillic Tatar},
year = {2026},
note = {Fine-tuned from Tesseract rus; synthetic + real book lines with scan augmentation},
url = {https://huggingface.co/yasalma/tatar-ocr-tesseract}
}
- Downloads last month
- -