Finetunning text recognition models for other languages

#4
by tprochenka - opened

Hi, I recently came across doctr and it gives in my case a way better OCR results that tesseract however it makes "silly" mistakes because it doesn't know polish language. I would like to fine-tune recognition model to polish language. I'm wondering how did you solve the problem of dataset for french language? Did you generated artificial data? How many samples were needed to fine tune the model to french language?

Thanks in advance!
Tomek

Hi Tomek :) ,

mindee has created a internal real dataset of ~500k different documents (~10M word crops) to train the models from scratch. The polish vocabulary is very similar to the french one so it should be possible to fine-tune the pretrained models with less data (~10k-20k samples).

Best regards,
Felix

Thanks Felix,
we will give it a try :)

Felix92 changed discussion status to closed

Sign up or log in to comment