README.md · MarineLives/README at 1e215f0517521e8f3ad0a3dad3858c746efd495b

metadata

title: README
emoji: 🚀
colorFrom: blue
colorTo: yellow
sdk: static
pinned: false

MarineLives is a volunteer-led collaboration for the transcription and enrichment of English High Court of Admiralty records from the C16th and C17th. The records provide a rich and underutilised source of social, material and economic history.

RESEARCH FOCUS

Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected HTR for the same tokens with page to page congruence, and broadly line by line congruence.

Fine-tuning and comparing:
- mT5-small model (300 mill parameters)
- GPT-2 Small model (124 mill parameters)
- LLaMA 3.1 1B model (1 bill parameters)
Starting with testing the capabilities of the mT5-Small model using:

(a) Page to Page dataset of 100 .txt pages of raw HTR output which are congruent with 100 pages of hand-corrected HTR output to near Ground Truth standard (b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,000 lines of hand corrected HTR output
Fine-tuning and comparing the same models with increasingly larger training data sets
- 100 pages = 40,000 lines = 0.4 mill words
- 200 pages = 80,000 lines = 0.8 mill words
- 400 pages = 160,000 lines = 1.6 mill words
- 800 pages = 340,000 lines = 3.2 mill words
Examine the following outputs from fine tuning:
- Ability to correct words according to their Parts of Speech
- Ability to correct words according to their semantic context (specifying the number of words or tokens before and after a word in which to look for semantic context)
- Ability to assess grammatical correctness of Early Modern English and Notarial Latin at phrase level
- Ability to identiy and distinguish English and Latin language text
- Ability to accurately identify and delete HTR artefacts (produced by non-textual data on original scanned image)
- Ability to identify redundant or duplicated words which were deleted in original manuscript but have been included without deletion marks in the HTR text output, and to propose for deletion to human expert
- Ability to insert text at an insertion mark recorded in the HTR output text, selecting the text to inset from the line above or below the line containing the insertion mark
- Ability to identify structural components of a legal deposition (front matter; section headings; numbered articles in allegations; numbered positions in libels; signatures)
Explore the ability to use a fine-tuned domain specific small LLM to control post-HTR cleanup process steps
- Process Step One: Run rule-based Python script to expand abbreviations and contractions
- Process Step Two: Run LLM-based process to (a) auto-correct clear errors (b) escalating correction options to a human expert, providing logic, and requesting a decision
Examine existing benchmarks for transcription accuracy and apply to fine-tuned models and develop domain specific benchmarkes for transcription accuracy and apply to fine-tuned models
User testing of impact of corrections via fine-tuned small LLMs
- Correction of single letter errors in word
- Correction of doule letter errors in word
- Correction of single letter omission in word
- Correction of double letter ommission in word
User testing of readability of raw HTR and different levels of machine and hand correction
- Impact on readability of raw HRT + rules based Python script optimised to domain
- Impact on readability of raw HTR + rules based Python script optimised to domain + different categories of fin-tuned SMALL LLM machine adjustment

DATASETS

We have three datasets available to researchers working on Early Modern English in the late C16th and early to mid-C17th:

Hand transcribed Ground Truth [420,000 tokens]
Machine transcribed and hand corrected corpus [4.5 mill tokens]
Hand transcribed Early Modern non-elite letters [100,000 tokens]

Dataset 1 is a full diplomatic transcription, preserving abbreviations, contractions, capitalisation, punctuation, spelling variation, and syntax. It comprises roughly thirty different notarial hands drawn from sixteen different volumes of depositions made in the English High Court of Admiralty between 1627 and 1660.[ HCA 13/46; HCA 13/48; HCA 13/49; HCA 13/51; HCA 13/52; HCA 13/55; HCA 13/56; HCA 13/57; HCA 13/58; HCA 13/59; HCA 13/60; HCA 13/61; HCA 13/64; HCA 13/65; HCA 13/71; HCA 13/72]

Dataset 1 has been used to train multiple bespoke HTR-models. The most recent is 'HCA Secretary Hand 4.404 Pylaia'. Transkribus model ID =42966. The training parameters are: No base model Learning rate 0.00015 Target epochs = 500 epochs Early stopping = 400 epochs Compressed images Deslant turned on. CER = 6.10% with robust performance in the wild on different notarial hands, including unseen hands.

Dataset 2 is a semi-diplomatic transcription, which expands abbreviations and contractions, but preserves capitalisation, punctuation, spelling variation and syntax. It contains over sixty different notarial hands and is drawn from twelve different volumes written between between 1607 and 1660 [HCA 13/39; HCA 13/44; HCA 13/51; HCA 13/52; HCA 13/53;
HCA 13/57; HCA 13/58; HCA 13/61; HCA 13/63; HCA 13/68; HCA 13/71; HCA 13/73; HCA 13/63]

We are working on a significantly larger version of Dataset 2, which (when complete) will have circa 30 mill tokens and will comprise fifty-nine complete volumes of Admiralty Court depositions made between 1570 and 1685. We are targeting completion end 2025.

Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions, capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685