customization and label-studio backend

#4
by jribault - opened

Hi,

I really like your project; it works great! However, I have some PDFs that need fine-tuning. Could you provide some guidance on how to achieve that?

Additionally, I’m interested in adding more categories, such as "pleading number," which are currently mixed in with the paragraphs. Could you offer some advice on how to handle this as well?

Lastly, would it be easy to build a backend for Label Studio to integrate with this project?

Thanks!

HURIDOCS org

Hi,

Can you be more specific about what do you mean by fine-tuning PDFs?

For labels, we are using the labels in DocLayNet dataset. So maybe you can try to write a wrapper to add new categories. And to find the pleading numbers, you can try to use some heuristics on the "Text" segments.

About the Label Studio, we are going to check it and let's see maybe we can consider it.

Thank you for your interest in our project!

By fine-tuning, I mean adding some PDFs and retraining the model.

Working at the text level for the pleading number is the approach we currently have, but I don't think it's ideal. Pleading numbers are easily identifiable visually but not so much textually. This can lead to many errors, especially when specific numbers appear at the beginning of a line. In a "visual" mode, it's quite obvious where the pleading numbers are, and they shouldn't be part of the main text (similar to headers or footers).

FYI, section 6.1.4 of the DocLayNet_Labeling_Guide_Public.pdf states that line numbering should not be part of the text. If I'm seeing them included in the text, it means the model isn't handling some of my PDF layouts very well. Therefore, it would be useful to label those cases and retrain the model. This is why I'm particularly interested in integrating with Label Studio.

ali6parmak changed discussion status to closed

Sign up or log in to comment