magistermilitum/tridis_HTR

TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)

TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts.

A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163

Rules of transcription :

Main factor of semi-diplomatic edition is that abbreviations have been resolved:

both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini).
Likewise, those using conventional signs (⁊ --> et ; ꝓ --> pro) have been resolved.
The named entities (names of persons, places and institutions) have been capitalized.
The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
The consonantal i and u characters have been transcribed as j and v in both French and Latin.
The punctuation marks used in the manuscript like: . or / or | have not been systematically transcribed as the transcription has been standardized with modern punctuation.
Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end.

Corpora

The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries).

The training and evaluation ground-truth datasets involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using several freely available ground-truth corpora:

(Addionally, the model was pre-trained on a synthetic dataset (300k lines) generated using a GAN architecture.)

The Alcar-HOME database: https://zenodo.org/record/5600884
The e-NDP corpus: https://zenodo.org/record/7575693
The Himanis project: https://zenodo.org/record/5535306
Königsfelden Abbey corpus: https://zenodo.org/record/5179361
CODEA
Monumenta Luxemburgensia.

Accuracy

TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten (microsoft/trocr-large-handwritten) and a RoBERTa modelized on medieval texts (magistermilitum/RoBERTa_medieval).

This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries.

During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora.

Other formats

A CRNN+CTC version of this model trained on Kraken 4.0 (https://github.com/mittagessen/kraken) using the same gold-standard annotation is available in Zenodo:

Torres Aguilar, S., & Jolivet, V. (2024). TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10800223

Paper

A journal paper presenting the scientific basis of this models is also available:

Torres Aguilar, Sergio, Jolivet, Vincent . La reconnaissance de l'écriture pour les manuscrits documentaires du Moyen Âge, Journal of Data Mining & Digital Humanities, 22 décembre 2023 - https://hal.science/hal-03892163/document

magistermilitum
/

tridis_HTR