File size: 2,143 Bytes
8b7fca7
398fb99
 
 
 
 
 
 
 
 
 
 
 
8b7fca7
398fb99
 
8b7fca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language:
- bg
- cs
- de
- en
- es
- fi
- fr
- nl
- pl
- sl
- multilingual
tags:
- post-ocr correction
- ocr postcorrection
metrics:
- loss
- F1
---

# OCR postcorrection task 1

This is a BertForTokenClassification model that predicts whether a token is an OCR 
mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) 
and finetuned on the dataset of the 
[2019 ICDAR competition on post-OCR correction](https://sites.google.com/view/icdar2019-postcorrectionocr). 
It contains texts in the following languages:

- BG
- CZ 
- DE 
- EN 
- ES 
- FI 
- FR 
- NL 
- PL 
- SL

10% of the texts (stratified on language) were selected for validation. The test set is as provided.

The training data consists of (partially overlapping) sequences of 150 tokens. Only 
sequences with a normalized editdistance of < 0.3 were included in the train and 
validation set. The test set was not filtered on editdistance.

There are 3 classes in the data:

- 0: No OCR mistake
- 1: Start token of an OCR mistake
- 2: Inside token of an OCR mistake

## Results

| Set | Loss |
| -- | -- |
| Train | 0.224500 |
| Val | 0.285791 |
| Test | 0.4178357720375061 |

Average F1 by language:

| BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| 0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 |  0.8 | 0.69 |

## Demo

[Space for this model.](https://huggingface.co/spaces/jvdzwaan/ocrpostcorrection-task1-demo)

## Code

* [OCR post correction package](https://github.com/jvdzwaan/ocrpostcorrection)
* [Notebooks](https://github.com/jvdzwaan/ocrpostcorrection-notebooks)
  - [Jupyter notebook used for generating the training data](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/local/icdar-create-hf-dataset.ipynb)
  - [Jupyter notebook used for training the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-train.ipynb)
  - [Jupyter notebook used for evaluating the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-evaluation.ipynb)