Edit model card

OCR postcorrection task 1

This is a BertForTokenClassification model that predicts whether a token is an OCR mistake or not. It is based on bert-base-multilingual-cased and finetuned on the dataset of the 2019 ICDAR competition on post-OCR correction. It contains texts in the following languages:

  • BG
  • CZ
  • DE
  • EN
  • ES
  • FI
  • FR
  • NL
  • PL
  • SL

10% of the texts (stratified on language) were selected for validation. The test set is as provided.

The training data consists of (partially overlapping) sequences of 150 tokens. Only sequences with a normalized editdistance of < 0.3 were included in the train and validation set. The test set was not filtered on editdistance.

There are 3 classes in the data:

  • 0: No OCR mistake
  • 1: Start token of an OCR mistake
  • 2: Inside token of an OCR mistake

Results

Set Loss
Train 0.224500
Val 0.285791
Test 0.4178357720375061

Average F1 by language:

BG CZ DE EN ES FI FR NL PL SL
0.74 0.69 0.96 0.67 0.63 0.83 0.65 0.69 0.8 0.69

Demo

Space for this model.

Code

Downloads last month
7

Spaces using jvdzwaan/ocrpostcorrection-task-1 2