metadata
license: apache-2.0
language:
- en
- fr
- de
OCRerrcr is a small language model specialized for the detection of OCR error.
OCRerrcr was trained by Eliot Jones for PleIAs on a sample of 1000 documents with labelled OCR errors from open data documents (Finance Commons) and cultural heritage sources (Common Corpus).
To date, OCRerrcr provide the most accurate agnostic OCR error rate estimate. PleIAs has also develop an alternative pipeline for this tasks, OCRoscope, that scale significantly better but also significantly less accurate, especially for document with fewer mistakes.
The name OCRerrcr (instead of OCRerror) is a playful allusion to a common OCR misreading.