metadata

license: apache-2.0
language:
  - fi
metrics:
  - cer
pipeline_tag: image-text-to-text
tags:
  - OCR

Training model from AIDA-project

This repository contains the model trained in AIDA-project in PaddleOCR training format. The model is trained on top of PaddleOCR-v3 latin model. It is trained on roughly 40 000 line images and 120 000 synthetic line images. The training data is mainly in Finnish, but contains some Swedish and English text and little French and German lines.

This repository is for finetuning our trained model, but you can find the inference model and more information about the model here https://github.com/project-AIDA/. Additionally, the training data used for training this model can be found here https://huggingface.co/datasets/Kansallisarkisto/AIDA_ocr_training_data.

Model Training

In case you want to finetune the trained model, you should refer to the PaddleOCR docs here https://paddlepaddle.github.io/PaddleOCR/en/ppocr/model_train/recognition.html. This repository contains our checkpoints for our best performing model on Finnish language. Additionally it contains a config file for training the model. The necessary codes training can be found here https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md. You should download the codes and follow the installation instructions there

First you need prepare your data. You need textline and transcription combinations and you need to arrange them in PaddleOCR format. After that you can download the the model from this repository (include all best_accuracy.* files) and place them in a separate folder in your system. Then the last thing before training is to change the paths in the config file to correspond to your paths. I.e arguments called save_model_dir, pretrained_model and then in Train and Eval dataset data_dir and label_file_list.

After all that is done, you can start training. When run on the main folder of the downloaded github repository, the following command starts the training based on the configurations in the config file.

python3 tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml