Finetuning to another language

#2
by flaviooliveira - opened

Hi. I was wondering if it could be a good idea to fine-tune this recognition model to another language. I’ve tested Portuguese files and sometimes it works reasonably well, as we don’t have much data. Would it be hard?Thank you and pretty awesome work, congrats.

flaviooliveira changed discussion title from Maybe an issue with Stepwise Tool to Finetuning to another language
Swedish National Archives - Riksarkivet org

Hi, thank you for your kind words and appreciation for the demo!

The segmentation and line models are language agnostic. This means they don't specifically rely on the language of the text but rather on the layout and structure of the document. However, they are heavily optimized for documents with layouts similar to our examples. We are actively working on this, and new components/models will be introduced in the future to handle a variety of layouts.

Regarding the transcriber (SATRN) model, it's primarily trained on Swedish data. To achieve optimal performance for Portuguese, this model would benefit from fine-tuning. However, for the best results, pre-training it from scratch on Portuguese data would be ideal.

Could you provide a rough estimate of how much Portuguese data you have in terms of pages?

We'll soon be open-sourcing all the code for preparing and training the SATRN model, as well as for TrOCR. Stay tuned!

Hello, Gabriel. Thank you for your valuable comments. I still don't have an estimate of how much data will be available to us, but our idea is to do incremental tests to evaluate the results. Experimenting with your model, I surprisingly got decent results in certain occasions, especially when I used gpt-3.5-turbo as a post-processing step at the end as a language model.
I'm eagerly awaiting the SATRN model training/tuning code. I will certainly at least try to finetune it using the data we get here. Regarding the TrOCR model, I have been in contact with Dr. Phillip Ströbel of the University of Zurich, who has done a great work using the TrOCR model as part of the Bullinger Digital project for transcribing manuscripts of the correspondence of the Swiss reformer Heinrich Bullinger. We are working in a demo app, although much simpler than yours. My initial idea is to fine-tune their model to adapt it to Portuguese, but I'm still having some difficulties and I really don't know if it will be even possible with the data we have.
I recently found your work here and I believe it might be another possibility to try your model too. And it's awesome that you are also considering using TrOCR. I looked at your Github repository some days ago, but didn't find anything related to fine-tuning code, so I decided to ask your opinion here. Anyway, I'm impressed with your application in Spaces, it's the most complete application I've seen here. My sincere congratulations. Best regards,

Hello, Gabriel. Are you still planning to open source the code to prepare and train (finetune) your SATRN model? I have approximately 3,000 lines of two documents in Portuguese that I would like to try. Thanks.

Sign up or log in to comment