--- language: - bg - cs - de - en - es - fi - fr - nl - pl - sl - multilingual tags: - post-ocr correction - ocr postcorrection metrics: - loss - F1 --- # OCR postcorrection task 1 This is a BertForTokenClassification model that predicts whether a token is an OCR mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) and finetuned on the dataset of the [2019 ICDAR competition on post-OCR correction](https://sites.google.com/view/icdar2019-postcorrectionocr). It contains texts in the following languages: - BG - CZ - DE - EN - ES - FI - FR - NL - PL - SL 10% of the texts (stratified on language) were selected for validation. The test set is as provided. The training data consists of (partially overlapping) sequences of 150 tokens. Only sequences with a normalized editdistance of < 0.3 were included in the train and validation set. The test set was not filtered on editdistance. There are 3 classes in the data: - 0: No OCR mistake - 1: Start token of an OCR mistake - 2: Inside token of an OCR mistake ## Results | Set | Loss | | -- | -- | | Train | 0.224500 | | Val | 0.285791 | | Test | 0.4178357720375061 | Average F1 by language: | BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL | | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | 0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 | 0.8 | 0.69 | ## Demo [Space for this model.](https://huggingface.co/spaces/jvdzwaan/ocrpostcorrection-task1-demo) ## Code * [OCR post correction package](https://github.com/jvdzwaan/ocrpostcorrection) * [Notebooks](https://github.com/jvdzwaan/ocrpostcorrection-notebooks) - [Jupyter notebook used for generating the training data](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/local/icdar-create-hf-dataset.ipynb) - [Jupyter notebook used for training the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-train.ipynb) - [Jupyter notebook used for evaluating the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-evaluation.ipynb)