HamAndCheese82/math-ocr-donut-v2 · To Understand how DONUT is fine tuned on equation images to latex?

Hi, AbdulMuqtadir
Thank you for showing your interest on my model!

This model was developed to apply in our companies service, which is an educational service for mathematics.
We got serveral complaints from the users that the user interface for typing mathematical equation is very diffcult to use.
In response to those complaints, the model was specifically developed and used to provide a comfortable user experience, where the users can write the equation by their hands and the model will convert the hand-writings into LaTeX formatted texts.

The model was trained using the open-source Dataset provided by AI-hub and Handwritten Math Symbols Data from Kaggle.
AI-hub is a Korean website where they provide different kinds of datasets to support the Korean companies on developing different kinds of AI service.
Unfortunately, the datasets are only available to the ones who lives in the Republic of Korea.
However, on 16 April 2024, Google open-sourced the Dataset for Handwritten Mathematical Expression Recognition.
I think it will be more than enough to be used as a substitution if you are planning to train the model on your own.
I am also planning to fine-tune the Donut model again with it.

There were few problems I found while using the model:

The first problem was with the Tokenizer that the Donut is using.
Since the original Donut model was not built to be used on mathematical equations, the tokenizer lacks a lot of latex symbol codes (e.g. sigma, sum, frac, etc..).
Although, it wasn't a big problem, since the model still adapted well during fine-tuning, I still think there is a possibility of performance improvement when a tokenizer that has latex symbol codes are used.

The second problem was that the Tokenizer couldn't tokenize "{1".
If any equation with "{1" appears, the Tokenizer returns the unknown token.
This was critical because there was a lot of mathematical equations where "{1" is used (e.g. "\frac{1}{2}", "\2^{1+x}", etc..), which dropped down the model's accuracy.
Despite its effect, the solution was quite simple, which was to convert all the "{1" to "{ 1" on a training dataset.
With the solution, the model no longer has an issue with that specific case.

The third problem is that the model suffers when it was given with only a single symbol.
Unlike other OCR models, the Donut model directly converts inserted images into texts.
Therefore, the model has an ability to understand the context of the image given and provide a related response based on the context.
When only a single symbol was given to the model, because there aren't many contexts given in that image, model struggles to find the accurate answer.
For example, when only a number 1 is given, the model was confused whether the image is representing a number, letter, or a symbol, and thus, returns 1, l, or even (.
This problem is more emphasized when the model is provided with symbols that doesn't appear frequently on its training dataset.

And lastly, the model suffers with the characteristic of LaTeX code itself, where the same symbols can be written with more than one different codes.
For example, the symbol "->" can be written by using either "\rightarrow" or "\to".
This was a problem when I tried to measure the model's performance.
When the model seletects a different LaTeX code from the label, it will be marked wrong, even if the LaTeX code rendered result looks the same.

Despite the problems that I mentioned above, overall, I think the model performs pretty well on mathematical OCR tasks.
What's interesting is that because of characteristic (understanding the context) I mentioned in the third problem, the model's accuracy increases as the given equation gets longer.
Plus, the model didn't have any limitations on different fields of mathematics, such as linear equations, trigonometry, limits, calculus, and even writing angles or vectors.

I hope this response was sufficient enough to answer your questions.
Thank you.