Train dataset and different head

#1
by leoperelli - opened

Hi @nielsr , thanks for another great model. I was wondering on what type of data the model was trained, the model seems to have a different dictionary from the Naver Clova one (which has lots of Chinese characters).

I was also wondering what you believe should be changed to include a Token Classification head as for example the LayoutLM family.
Thanks!

Hi,

This model is exactly the same as https://huggingface.co/naver-clova-ix/donut-base. I just ported this one over to the Naver Clova organization.

I was also wondering what you believe should be changed to include a Token Classification head as for example the LayoutLM family.

Donut treats all tasks as a language modeling problem, hence it uses the same head (a language modeling) head for all tasks. No need to change the head. The model is just VisionEncoderDecoderModel, which includes a language modeling head on top of the decoder.

Thanks @nielsr !

Just a clarification on the head;
The current output has the dimension of the dictionary, since it predicts a sequence as output. To obtain a classification at token level should I change the output dimension to the number of classes? Or you mean to manage this with a different prompt?

The model predicts a sequence, which you can turn back into JSON using the token2json method of DonutProcessor.

Note that this model is entirely different compared to LayoutLM(v1/v2/v3). It doesn't output a classification at the token level, it just outputs a sequence which can be turned into JSON.

Hi @nielsr , I reviewed the architecture the Donut model and the Layout family.
I had the following questions:

  1. Donut last hidden states of the encoder have shape (batch_size, 1200, 768). What is the axis with 1200 dimension? How can I know which tokens they are? Do you believe it makes sense/ it is possible to use these encoder output to perform a token classification? This would help me to better understand performance of the model, which is lower than expected. I wanted to understand if something was wrong with the decoder or its something else.
  2. For the task token, I see some people fine tuning keeping the s token for their task. From my understanding, the sequence should be: "s""s_start_token" ... "/s_task_token""/s". Is this correct?
  3. The Donut dictionary seems to be quite limited, a lot of common words are split in several subtokens. Do you believe this is a big deal in the fine tuning performance? My model seems to overfit very quickly and not learn much about the OOV tokens.

Thanks a lot!

Sign up or log in to comment