jvdzwaan commited on
Commit
8b7fca7
1 Parent(s): dfb587e

Add model card (README.md)

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bg
4
+ - cz
5
+ - de
6
+ - en
7
+ - es
8
+ - fi
9
+ - fr
10
+ - nl
11
+ - pl
12
+ - sl
13
+ tags:
14
+ - "post-ocr correction"
15
+ - "ocr postcorrection"
16
+ metrics:
17
+ - loss
18
+ - F1
19
+ ---
20
+
21
+ # OCR postcorrection task 1
22
+
23
+ This is a BertForTokenClassification model that predicts whether a token is an OCR
24
+ mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
25
+ and finetuned on the dataset of the
26
+ [2019 ICDAR competition on post-OCR correction](https://sites.google.com/view/icdar2019-postcorrectionocr).
27
+ It contains texts in the following languages:
28
+
29
+ - BG
30
+ - CZ
31
+ - DE
32
+ - EN
33
+ - ES
34
+ - FI
35
+ - FR
36
+ - NL
37
+ - PL
38
+ - SL
39
+
40
+ 10% of the texts (stratified on language) were selected for validation. The test set is as provided.
41
+
42
+ The training data consists of (partially overlapping) sequences of 150 tokens. Only
43
+ sequences with a normalized editdistance of < 0.3 were included in the train and
44
+ validation set. The test set was not filtered on editdistance.
45
+
46
+ There are 3 classes in the data:
47
+
48
+ - 0: No OCR mistake
49
+ - 1: Start token of an OCR mistake
50
+ - 2: Inside token of an OCR mistake
51
+
52
+ ## Results
53
+
54
+ Loss and F1 measure on separate languages.
55
+
56
+ | Set | Loss |
57
+ | -- | -- |
58
+ | Train | 0.224500 |
59
+ | Val | 0.285791 |
60
+ | Test | 0.4178357720375061 |
61
+
62
+ Average F1 by language:
63
+
64
+ | BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
65
+ | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
66
+ | 0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 | 0.8 | 0.69 |
67
+
68
+ ## Demo
69
+
70
+ [Space for this model.](https://huggingface.co/spaces/jvdzwaan/ocrpostcorrection-task1-demo)
71
+
72
+ ## Code
73
+
74
+ * [OCR post correction package](https://github.com/jvdzwaan/ocrpostcorrection)
75
+ * [Notebooks](https://github.com/jvdzwaan/ocrpostcorrection-notebooks)
76
+ - [Jupyter notebook used for generating the training data](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/local/icdar-create-hf-dataset.ipynb)
77
+ - [Jupyter notebook used for training the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-train.ipynb)
78
+ - [Jupyter notebook used for evaluating the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-evaluation.ipynb)
79
+