Kansallisarkisto
/

PaddleOCR_training

Image-Text-to-Text

Model card Files Files and versions Community

PaddleOCR_training / README.md

fohra's picture

Update README.md

a7725d7 verified 3 months ago

|

history blame contribute delete

2.15 kB

	---
	license: apache-2.0
	language:
	- fi
	metrics:
	- cer
	pipeline_tag: image-text-to-text
	tags:
	- OCR
	---

	# Training model from AIDA-project

	This repository contains the model trained in AIDA-project in PaddleOCR training format. The model is trained on top of PaddleOCR-v3 latin model. It is trained on roughly
	40 000 line images and 120 000 synthetic line images. The training data is mainly in Finnish, but contains some Swedish and English text and little French and German lines.

	This repository is for finetuning our trained model, but you can find the inference model and more information about the model here https://github.com/project-AIDA/.
	Additionally, the training data used for training this model can be found here https://huggingface.co/datasets/Kansallisarkisto/AIDA_ocr_training_data.

	## Model Training

	In case you want to finetune the trained model, you should refer to the PaddleOCR docs here https://paddlepaddle.github.io/PaddleOCR/en/ppocr/model_train/recognition.html.
	This repository contains our checkpoints for our best performing model on Finnish language. Additionally it contains a config file for training the model. The necessary
	codes training can be found here https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md. You should download the codes and follow the installation instructions
	there

	First you need prepare your data. You need textline and transcription combinations and you need to arrange them in PaddleOCR format. After that you can download the the
	model from this repository (include all best_accuracy.* files) and place them in a separate folder in your system. Then the last thing before training is to change the
	paths in the config file to correspond to your paths. I.e arguments called save_model_dir, pretrained_model and then in Train and Eval dataset data_dir and
	label_file_list.

	After all that is done, you can start training. When run on the main folder of the downloaded github repository, the following command starts the training based on the
	configurations in the config file.

	`python3 tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml`