Train dataset and different head

by leoperelli - opened Sep 14, 2022

Sep 14, 2022

•

edited Sep 15, 2022

Hi @nielsr , thanks for another great model. I was wondering on what type of data the model was trained, the model seems to have a different dictionary from the Naver Clova one (which has lots of Chinese characters).

I was also wondering what you believe should be changed to include a Token Classification head as for example the LayoutLM family.
Thanks!

nielsr

Owner Sep 15, 2022

Hi,

This model is exactly the same as https://huggingface.co/naver-clova-ix/donut-base. I just ported this one over to the Naver Clova organization.

I was also wondering what you believe should be changed to include a Token Classification head as for example the LayoutLM family.

Donut treats all tasks as a language modeling problem, hence it uses the same head (a language modeling) head for all tasks. No need to change the head. The model is just VisionEncoderDecoderModel, which includes a language modeling head on top of the decoder.

leoperelli

Sep 15, 2022

Thanks @nielsr !

Just a clarification on the head;
The current output has the dimension of the dictionary, since it predicts a sequence as output. To obtain a classification at token level should I change the output dimension to the number of classes? Or you mean to manage this with a different prompt?

nielsr

Owner Sep 16, 2022

The model predicts a sequence, which you can turn back into JSON using the token2json method of DonutProcessor.

Note that this model is entirely different compared to LayoutLM(v1/v2/v3). It doesn't output a classification at the token level, it just outputs a sequence which can be turned into JSON.

leoperelli

Sep 16, 2022

•

edited Sep 22, 2022

Hi @nielsr , I reviewed the architecture the Donut model and the Layout family.
I had the following questions:

Donut last hidden states of the encoder have shape (batch_size, 1200, 768). What is the axis with 1200 dimension? How can I know which tokens they are? Do you believe it makes sense/ it is possible to use these encoder output to perform a token classification? This would help me to better understand performance of the model, which is lower than expected. I wanted to understand if something was wrong with the decoder or its something else.
For the task token, I see some people fine tuning keeping the s token for their task. From my understanding, the sequence should be: "s""s_start_token" ... "/s_task_token""/s". Is this correct?
The Donut dictionary seems to be quite limited, a lot of common words are split in several subtokens. Do you believe this is a big deal in the fine tuning performance? My model seems to overfit very quickly and not learn much about the OOV tokens.

Thanks a lot!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment