Is table tag working?

#1
by apolo - opened

Hi everyone.
First of all, thank you so much @pierreguillou . The work you are doing about document layout is amazing.

I would like to ask you if the table tag is included in this version for paragraph level. I have been testing it with some docs but no table has been detected.
I don't know if it is because of the document layout or because the table tag doesn't work.

apolo changed discussion title from Table label. to Is table tag working?

Hi @apolo ,
Thanks for your feedback.

First of all, the line and paragraph models have exactly the same labels (the DocLayNet ones), with Table in particular.
Just run the following code to get confirmation of that:

from transformers import AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(model_id)
id2label = model.config.id2label
id2label

About your comment: I just did a test with a file and you're right: the paragraph model doesn't detect table when the line model does (at least within my test).

My understanding is that the concepts of Table-line and Table-paragraph are different:

  • all table rows are more or less similar in terms of width and height (i.e. they convey similar information to the model, or a unique Table-line concept): it means that all table rows in the training dataset create a homogeneous subset of data that was sufficient (ie, the number of example) for train the line model well
  • however, all paragraphs of the table (whole table) are more different in terms of width and height: in this case, we need more examples in the training dataset because of this variety (ie, multiple Table-paragraph concepts), and apparently it was not the case for the base DocLayNet dataset.

Another point: you can check with the DocLayNet viewer APP (https://huggingface.co/spaces/pierreguillou/DocLayNet-image-viewer) that there are pages with Table-paragraph in the training dataset DocLayNet base.
But, again, this number is certainly very small in relation to the diversity of Table-paragraphs.

Only one way to check my hypothesis: fine-tune a layout model with the dataset DocLayNet large :-)

@apolo ,

Another idea: as the line model detects Table, you can first run the paragraph model to get list of paragraphs on a page, and then run the line model separately on all Text-paragraphs (there is only a small chance to get a table in a Section-header paragraph...). Thus, you will get the tables :-)

@pierreguillou
Great! Thank you for your answer.

Perfect , I am going to fine-tune several layout models with the dataset DocLayNet large. As soon as I can I will share the results.
In the meantime I will try the last thing you said.

Sign up or log in to comment