jvdzwaan's picture
Add multilingual to the language tag (#1)
398fb99
---
language:
- bg
- cs
- de
- en
- es
- fi
- fr
- nl
- pl
- sl
- multilingual
tags:
- post-ocr correction
- ocr postcorrection
metrics:
- loss
- F1
---
# OCR postcorrection task 1
This is a BertForTokenClassification model that predicts whether a token is an OCR
mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
and finetuned on the dataset of the
[2019 ICDAR competition on post-OCR correction](https://sites.google.com/view/icdar2019-postcorrectionocr).
It contains texts in the following languages:
- BG
- CZ
- DE
- EN
- ES
- FI
- FR
- NL
- PL
- SL
10% of the texts (stratified on language) were selected for validation. The test set is as provided.
The training data consists of (partially overlapping) sequences of 150 tokens. Only
sequences with a normalized editdistance of < 0.3 were included in the train and
validation set. The test set was not filtered on editdistance.
There are 3 classes in the data:
- 0: No OCR mistake
- 1: Start token of an OCR mistake
- 2: Inside token of an OCR mistake
## Results
| Set | Loss |
| -- | -- |
| Train | 0.224500 |
| Val | 0.285791 |
| Test | 0.4178357720375061 |
Average F1 by language:
| BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| 0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 | 0.8 | 0.69 |
## Demo
[Space for this model.](https://huggingface.co/spaces/jvdzwaan/ocrpostcorrection-task1-demo)
## Code
* [OCR post correction package](https://github.com/jvdzwaan/ocrpostcorrection)
* [Notebooks](https://github.com/jvdzwaan/ocrpostcorrection-notebooks)
- [Jupyter notebook used for generating the training data](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/local/icdar-create-hf-dataset.ipynb)
- [Jupyter notebook used for training the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-train.ipynb)
- [Jupyter notebook used for evaluating the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-evaluation.ipynb)