|
--- |
|
license: apache-2.0 |
|
language: |
|
- fi |
|
metrics: |
|
- cer |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- OCR |
|
--- |
|
|
|
# Training model from AIDA-project |
|
|
|
This repository contains the model trained in AIDA-project in PaddleOCR training format. The model is trained on top of PaddleOCR-v3 latin model. It is trained on roughly |
|
40 000 line images and 120 000 synthetic line images. The training data is mainly in Finnish, but contains some Swedish and English text and little French and German lines. |
|
|
|
This repository is for finetuning our trained model, but you can find the inference model and more information about the model here https://github.com/project-AIDA/. |
|
Additionally, the training data used for training this model can be found here https://huggingface.co/datasets/Kansallisarkisto/AIDA_ocr_training_data. |
|
|
|
## Model Training |
|
|
|
In case you want to finetune the trained model, you should refer to the PaddleOCR docs here https://paddlepaddle.github.io/PaddleOCR/en/ppocr/model_train/recognition.html. |
|
This repository contains our checkpoints for our best performing model on Finnish language. Additionally it contains a config file for training the model. The necessary |
|
codes training can be found here https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md. You should download the codes and follow the installation instructions |
|
there |
|
|
|
First you need prepare your data. You need textline and transcription combinations and you need to arrange them in PaddleOCR format. After that you can download the the |
|
model from this repository (include all best_accuracy.* files) and place them in a separate folder in your system. Then the last thing before training is to change the |
|
paths in the config file to correspond to your paths. I.e arguments called **save_model_dir**, **pretrained_model** and then in Train and Eval dataset **data_dir** and |
|
**label_file_list**. |
|
|
|
After all that is done, you can start training. When run on the main folder of the downloaded github repository, the following command starts the training based on the |
|
configurations in the config file. |
|
|
|
`python3 tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml` |