Could this model be used to directly extract tables from PDFs into CSV files?

#3
by sergenti - opened

I tried countless python libraries and online converters, and AI is the best approach so far.

I was thinking of applying this model to each page of the PDF to get the coordinates of the tables and then passing those to another tool that extracts the values. Do you have some ideas on how I should structure the pipeline?

I'm dealing with the following kinds of documents:

  • scientific papers
  • legal documents
  • standard financial documents (10k, 10q, s1, etc)
  • confidential documents (reports, transcripts, memorandums)

Hi,

One option you could try is:

  1. use the Table Transformer to detect tables in documents
  2. crop those tables out
  3. feed the table images to a model like Donut or more generally VisionEncoderDecoderModel, which is a model that takes an image as input and produces text as output. This class can actually learn to generate any text you want from a given image, as long as you can train it on (image, text) pairs. Hence you can train it to take a table image as input and produce the text in that table, potentially in JSON format, as output.
This comment has been hidden

Hi @nielsr , that's a great idea; however, for the sake of simplicity, I plugged in a normal table extraction library like https://tabula-py.readthedocs.io/ or https://camelot-py.readthedocs.io. It's not perfect but does the job.

Weirdly enough, extracting tables from PDF is a problem so difficult nobody has solved it yet, lol.

I'm sure the ultimate approach is AI-based. Do you know if there are any models trained for image-to-CSV / image-to-JSON? How would you tackle the training challenge yourself?

Thanks a lot and have a great day!

Donut is the first model trained directly on images to produce JSON as output. So one could simply extend this to train on (document images, JSON of the table in each document) pairs. I'm pretty sure Google is already doing this, but such a model on tables isn't open-sourced yet.

Any new findings on this topic? Trying to solve this problem myself. I have PDFs with text and tables interspersed, and need the text extracted with any table structure preserved. Tabula/Camelot/pdfplumber can (sometimes) extract the tables, but it's not accurate enough for my use case.

@nesuter try Adobe PDF Extract API, it's the best API in the market, I didn't find anything similar opensource

@nesuter you can try unstructured library

Sign up or log in to comment