jvdzwaan
/

ocrpostcorrection-task-1

Token Classification

post-ocr correction

ocr postcorrection

Inference Endpoints

Model card Files Files and versions Community

ocrpostcorrection-task-1 / README.md

jvdzwaan's picture

Add multilingual to the language tag (#1)

398fb99 over 1 year ago

|

history blame contribute delete

No virus

2.14 kB

	---
	language:
	- bg
	- cs
	- de
	- en
	- es
	- fi
	- fr
	- nl
	- pl
	- sl
	- multilingual
	tags:
	- post-ocr correction
	- ocr postcorrection
	metrics:
	- loss
	- F1
	---

	# OCR postcorrection task 1

	This is a BertForTokenClassification model that predicts whether a token is an OCR
	mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
	and finetuned on the dataset of the
	[2019 ICDAR competition on post-OCR correction](https://sites.google.com/view/icdar2019-postcorrectionocr).
	It contains texts in the following languages:

	- BG
	- CZ
	- DE
	- EN
	- ES
	- FI
	- FR
	- NL
	- PL
	- SL

	10% of the texts (stratified on language) were selected for validation. The test set is as provided.

	The training data consists of (partially overlapping) sequences of 150 tokens. Only
	sequences with a normalized editdistance of < 0.3 were included in the train and
	validation set. The test set was not filtered on editdistance.

	There are 3 classes in the data:

	- 0: No OCR mistake
	- 1: Start token of an OCR mistake
	- 2: Inside token of an OCR mistake

	## Results

	\| Set \| Loss \|
	\| -- \| -- \|
	\| Train \| 0.224500 \|
	\| Val \| 0.285791 \|
	\| Test \| 0.4178357720375061 \|

	Average F1 by language:

	\| BG \| CZ \| DE \| EN \| ES \| FI \| FR \| NL \| PL \| SL \|
	\| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \|
	\| 0.74 \| 0.69 \| 0.96 \| 0.67 \| 0.63 \| 0.83 \| 0.65 \| 0.69 \| 0.8 \| 0.69 \|

	## Demo

	[Space for this model.](https://huggingface.co/spaces/jvdzwaan/ocrpostcorrection-task1-demo)

	## Code

	* [OCR post correction package](https://github.com/jvdzwaan/ocrpostcorrection)
	* [Notebooks](https://github.com/jvdzwaan/ocrpostcorrection-notebooks)
	- [Jupyter notebook used for generating the training data](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/local/icdar-create-hf-dataset.ipynb)
	- [Jupyter notebook used for training the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-train.ipynb)
	- [Jupyter notebook used for evaluating the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-evaluation.ipynb)