tridis_HTR / README.md
magistermilitum's picture
Update README.md
0c78215 verified
|
raw
history blame
3.84 kB
metadata
license: mit
widget:
  - text: Universis presentes [MASK] inspecturis
  - text: eandem [MASK] per omnia parati observare
  - text: yo [MASK] rey de Galicia, de las Indias
  - text: en avant contre les choses [MASK] contenues
datasets:
  - cc100
  - bigscience-historical-texts/Open_Medieval_French
  - latinwikipedia
language:
  - la
  - fr
  - es
tags:
  - handwritten-text-recognition
pipeline_tag: image-to-text

TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)

TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts.

A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163

Rules of transcription :

Main factor of semi-diplomatic edition is that abbreviations have been resolved:

  • both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini).
  • Likewise, those using conventional signs ( --> et ; --> pro) have been resolved. 
  • The named entities (names of persons, places and institutions) have been capitalized.
  • The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
  • The consonantal i and u characters have been transcribed as j and v in both French and Latin.
  • The punctuation marks used in the manuscript like: . or / or | have not been systematically transcribed as the transcription has been standardized with modern punctuation.
  • Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end.

Corpora

The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries).

The training and evaluation involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using three freely available ground-truth corpora:

Accuracy

TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten (microsoft/trocr-large-handwritten) and a RoBERTa modelized on medieval texts (magistermilitum/RoBERTa_medieval).

This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries.

During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora.