About OCR Training Schema

#20
by Cuiunbo - opened

Hi, this is truly a wonderful job; your open-source work is exceptionally nice and has helped me understand a lot. However, there's a detail I'd like to aks further about. I noticed that idefics2-8b did not use special tokens to represent coordinates, nor used special tokens like to enclose coordinates, which is a common practice in previous works. I'm curious about how training schema for OCR tasks look like, such as IDL, representing positions using only "text"?

Thanks for your comment!

Yes it's true we haven't worked on bounding boxes.

We trained on IDL and PDFA with the next token prediction objective.
However, as you noticed, there is often an ambiguity in PDF to know if a text comes before another (in the context of tables, multi-columns, etc...)

We simply put all the texts together (not necessarily always in the right or ideal order) and let the model predict this.
We acknowledge that this is not the perfect solution.

HugoLaurencon changed discussion status to closed

Sign up or log in to comment