@Pclanglais on Hugging Face: "Announcing that we are on our way to solve a long standing issue of document…"

Post

1645

Announcing that we are on our way to solve a long standing issue of document processing: correction of OCR mistakes. Pleias publishes the largest dataset to date with automated OCR correction, 1 billion words in English, French, German and Italian.

OCR quality is long-standing issue of digitization. Cultural heritage texts are especially concerned due to the primary sources being old documents (with many artifacts, blots, degradation) and to the limitation of OCR technology for historical scripts. When we released Common Corpus, a 500 Billion words corpus in the public domain, this was the primary criticism.

Recent breakthrough in post-OCR correction has been made possible thanks to progress in open LLM research and several months of dedicated training and alignment by Pleias as well as the HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

Announcement: https://huggingface.co/blog/Pclanglais/post-ocr-correction

Post-OCR-Correction dataset: PleIAs/Post-OCR-Correction

Join the conversation