|
--- |
|
language: |
|
- bg |
|
- cs |
|
- de |
|
- en |
|
- es |
|
- fi |
|
- fr |
|
- nl |
|
- pl |
|
- sl |
|
- multilingual |
|
tags: |
|
- post-ocr correction |
|
- ocr postcorrection |
|
metrics: |
|
- loss |
|
- F1 |
|
--- |
|
|
|
# OCR postcorrection task 1 |
|
|
|
This is a BertForTokenClassification model that predicts whether a token is an OCR |
|
mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) |
|
and finetuned on the dataset of the |
|
[2019 ICDAR competition on post-OCR correction](https://sites.google.com/view/icdar2019-postcorrectionocr). |
|
It contains texts in the following languages: |
|
|
|
- BG |
|
- CZ |
|
- DE |
|
- EN |
|
- ES |
|
- FI |
|
- FR |
|
- NL |
|
- PL |
|
- SL |
|
|
|
10% of the texts (stratified on language) were selected for validation. The test set is as provided. |
|
|
|
The training data consists of (partially overlapping) sequences of 150 tokens. Only |
|
sequences with a normalized editdistance of < 0.3 were included in the train and |
|
validation set. The test set was not filtered on editdistance. |
|
|
|
There are 3 classes in the data: |
|
|
|
- 0: No OCR mistake |
|
- 1: Start token of an OCR mistake |
|
- 2: Inside token of an OCR mistake |
|
|
|
## Results |
|
|
|
| Set | Loss | |
|
| -- | -- | |
|
| Train | 0.224500 | |
|
| Val | 0.285791 | |
|
| Test | 0.4178357720375061 | |
|
|
|
Average F1 by language: |
|
|
|
| BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL | |
|
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | |
|
| 0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 | 0.8 | 0.69 | |
|
|
|
## Demo |
|
|
|
[Space for this model.](https://huggingface.co/spaces/jvdzwaan/ocrpostcorrection-task1-demo) |
|
|
|
## Code |
|
|
|
* [OCR post correction package](https://github.com/jvdzwaan/ocrpostcorrection) |
|
* [Notebooks](https://github.com/jvdzwaan/ocrpostcorrection-notebooks) |
|
- [Jupyter notebook used for generating the training data](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/local/icdar-create-hf-dataset.ipynb) |
|
- [Jupyter notebook used for training the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-train.ipynb) |
|
- [Jupyter notebook used for evaluating the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-evaluation.ipynb) |
|
|