Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

Community Article Published April 26, 2024

Pleias publishes the largest open dataset with automated OCR correction:

  • Post-OCR Correction includes 1 billion words coming from Common Corpus, a 500 billion words open corpus released last month by Pleias.
  • The dataset is multilingual and includes cultural heritage texts from newspapers and monographs in French, English, German and Italian.
  • Recent breakthrough in post-OCR correction has been made possible thanks to progress in open LLM research and several months of dedicated training and alignment by Pleias.
  • Generation of Post-OCR correction was performed using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

The problem with OCR

OCR quality is long-standing issue of digitization. Cultural heritage texts are especially concerned due to the primary sources being old documents (with many artifacts, blots, degradation) and to the limitation of OCR technology for historical scripts.

When we released Common Corpus one month ago, this was one of the main criticisms, as most of the texts came with many errors (frequently at least once every 10 words, sometimes even much more). This obviously raises many uncertainty in regards to the potential use of this resource for cultural analytics and language models training.

OCR models already use a very primitive form of LLMs: they tend to replace badly scripted words with more probable words in contemporary texts. That’s how we end up with many occurrences of “internet” in 19th century works in digital librairies (typically a distortion of misprints from "interest").

Transformers or SSM models have the potential to perform this same task considerably better, as they take into account the entire specific context surrounding a word. In theory they could provide the same quality of word replacement than a human reader. Yet this is more complicated. Our initial tests showed a range of issues, including hallucinations and omission. And a very fascinating problem that could warrant further research: language switching. OCR mistakes seem to mess up with language detection in the embedding space and models would generate corrected texts in… French or German. This is very prevalent in zero-shot with generalist models like Claude, way rarer in our dataset but can still occur.

To give a typical example this excerpt from The New York Herald Tribune (May 9th, 1853):

FTSAffCIAX AJTD COMMKRCIAL.

MONKY MARKET.

Sunday, May &?6 P. M.

the lapector U4**nl of Canada liu Juat laid before ne Is^i-lature the mnutluhlr. of the tr?de au 1 navigation of tlut province far 1H52 mule up to the Oittt of iael January, a document preparrd ?iib <r-?t c > re and fixalif a volume of four huuln^l an i tof f ei>cht pages, it am whieh we perceive then has bran a faiiin/ olT in the iinpertnlione from Greet Uritain during the i?a<t year to the extent at $1,377 000 Tbi? appears to have been the raatdt, however, of an exc?ns of iiu portati.tna froa that oountry in 1861 which exceeded th >?e >f 1850 by $2,416, 212. But there wa- an Increase io 1862 over 1851 from the Nerth Ameiica"

is uncorrectly corrected in French with a few occasional anglicisms:

FTS Affiché et Commerce Commercial.

MONNAIE MARKET.

Dimanche, May 26 P. M.

Le législateur a fait devant l'assemblée la présentation de l'importance de la traite et de la navigation de la province pour 1852, faisant up à la fin de janvier, un document préparé sous la direction de et fixé à un volume de quatre heures et demie, contenant huit pages, ce que nous percevons qu'il y a eu une diminution dans l'importance de la Great Britain durant l'année précédente à l'extent de $1,377,000. Cela semble avoir été le résultat, cependant, d'une excédent de l'importation provenant de ce pays en 1861 qui a été de $2,416,212. Mais il y a eu une augmentation en 1862 par rapport à 1851 de la part des colonies d'Amérique.

Current results and use cases

Pleias is a French start-up specialized in the training of Large Language Models for document processing on fully open and auditable corpus.

We have been working on post-OCR correction by LLMs for nearly three months. Results are now encouraging most of the time, on a variety of European languages, even when the text is severely degraded.

For instance, this excerpt with many OCR errors from the Omaha Bee (June 25, 1890):

"THE OMAHA ! DAILY BEE.

TWENTIETH YEAR. OMAHA. WEDNESDAY JMjgNING. ( ! JUNE 25. 1890. NUMBER 7.

LICKED UP BY THE FLAMES , An Incendiary Wreaks His Vengeance o Blue Hill , Nebraska. NEARLY TWENTY STORES BLOTTED OUT , Tlio Amount of lnmnc Done Iloimlily Kutlmnted .nt Over Fifty Thousand DollurH , With Comparatively Little Insurance.

BLUB HIM , Neb. , Juno 24. ( Special Tele-pram to TUB BBK. ) At 2M : this morning a.flro broke out simultaneously In two places on the north sldo of Main street in Blue Hill. The ono at the opera house , nt almost the ex treme cast end of the street , was extinguished by the efforts of O. C. 1C. Lolgman , Mrs. B. II. Munson and the girl help at the Muuson Louse. I"

is very correctly reconstructed like this with only a few elements that could have been incorrect (2M becomes 2:30).

"THE OMAHA DAILY BEE.

TWENTIETH YEAR. OMAHA, WEDNESDAY MORNING. JUNE 25, 1890. NUMBER 7.

LICKED UP BY THE FLAMES,

An Incendiary Wreaks His Vengeance on Blue Hill, Nebraska.

NEARLY TWENTY STORES BLOTTED OUT,

The Amount of Damage Done Is Estimated at Over Fifty Thousand Dollars, With Comparatively Little Insurance.

BLUE HILL, Neb., June 24. (Special Telegraph to THE BEE.) At 2:30 this morning a fire broke out simultaneously in two places on the north side of Main street in Blue Hill. The one at the opera house, at almost the extreme east end of the street, was extinguished by the efforts of O. C. J. Longman, Mrs. B. H. Munson and the girl help at the Munson House.

All the texts in the dataset come from collections integrated into Common Corpus,. The corpus comprises mostly 19th century texts in in French, English, German and Italian with the following distribution:

  • French: newspaper texts from Gallica, 438,034,960 words.
  • English: newspaper texts from Chronicling America, 300,522,681 words.
  • Italian: monographs texts from various sources, notably Internet Archive, 144,441,539 words.
  • German: monographs texts from various sources, notably Internet Archive, 97,396,147 words.

As part of Pleias commitment to open science, this release aims to collectively assess the quality of post-OCR correction process, prior to the release of our post-OCR correction LLM-based models.

While we would not recommend to directly use the text output at this stage, this can already be a potential resource of OCR editing, especially for community initiatives like Wikisource. Also text analysis of large OCR corpus in Digital Humanities can benefit for corrected text even with a residual risk of errors.