Finetunning text recognition models for other languages

by tprochenka - opened Jan 23, 2023

Jan 23, 2023

Hi, I recently came across doctr and it gives in my case a way better OCR results that tesseract however it makes "silly" mistakes because it doesn't know polish language. I would like to fine-tune recognition model to polish language. I'm wondering how did you solve the problem of dataset for french language? Did you generated artificial data? How many samples were needed to fine tune the model to french language?

Thanks in advance!
Tomek

Felix92

Owner Jan 23, 2023

•

edited Jan 23, 2023

Hi Tomek :) ,

mindee has created a internal real dataset of ~500k different documents (~10M word crops) to train the models from scratch. The polish vocabulary is very similar to the french one so it should be possible to fine-tune the pretrained models with less data (~10k-20k samples).

Best regards,
Felix

tprochenka

Jan 24, 2023

Thanks Felix,
we will give it a try :)

Felix92 changed discussion status to closed Feb 10, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment