About OCR Training Schema

#20

by Cuiunbo - opened 14 days ago

14 days ago

Hi, this is truly a wonderful job; your open-source work is exceptionally nice and has helped me understand a lot. However, there's a detail I'd like to aks further about. I noticed that idefics2-8b did not use special tokens to represent coordinates, nor used special tokens like to enclose coordinates, which is a common practice in previous works. I'm curious about how training schema for OCR tasks look like, such as IDL, representing positions using only "text"?

HugoLaurencon

HuggingFaceM4 org 14 days ago

•

edited 14 days ago

Thanks for your comment!

Yes it's true we haven't worked on bounding boxes.

We trained on IDL and PDFA with the next token prediction objective.
However, as you noticed, there is often an ambiguity in PDF to know if a text comes before another (in the context of tables, multi-columns, etc...)

We simply put all the texts together (not necessarily always in the right or ideal order) and let the model predict this.
We acknowledge that this is not the perfect solution.

HugoLaurencon changed discussion status to closed 10 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment