Could this model be used to directly extract tables from PDFs into CSV files?

by sergenti - opened Feb 17, 2023

Feb 17, 2023

I tried countless python libraries and online converters, and AI is the best approach so far.

I was thinking of applying this model to each page of the PDF to get the coordinates of the tables and then passing those to another tool that extracts the values. Do you have some ideas on how I should structure the pipeline?

I'm dealing with the following kinds of documents:

scientific papers
legal documents
standard financial documents (10k, 10q, s1, etc)
confidential documents (reports, transcripts, memorandums)

nielsr

Feb 18, 2023

•

edited Feb 18, 2023

Hi,

One option you could try is:

use the Table Transformer to detect tables in documents
crop those tables out
feed the table images to a model like Donut or more generally VisionEncoderDecoderModel, which is a model that takes an image as input and produces text as output. This class can actually learn to generate any text you want from a given image, as long as you can train it on (image, text) pairs. Hence you can train it to take a table image as input and produce the text in that table, potentially in JSON format, as output.

Mileena

Feb 22, 2023

This comment has been hidden

sergenti

Feb 23, 2023

•

edited Feb 23, 2023

Hi @nielsr , that's a great idea; however, for the sake of simplicity, I plugged in a normal table extraction library like https://tabula-py.readthedocs.io/ or https://camelot-py.readthedocs.io. It's not perfect but does the job.

Weirdly enough, extracting tables from PDF is a problem so difficult nobody has solved it yet, lol.

I'm sure the ultimate approach is AI-based. Do you know if there are any models trained for image-to-CSV / image-to-JSON? How would you tackle the training challenge yourself?

Thanks a lot and have a great day!

nielsr

Feb 24, 2023

•

edited Feb 24, 2023

Donut is the first model trained directly on images to produce JSON as output. So one could simply extend this to train on (document images, JSON of the table in each document) pairs. I'm pretty sure Google is already doing this, but such a model on tables isn't open-sourced yet.

nesuter

Aug 2, 2023

Any new findings on this topic? Trying to solve this problem myself. I have PDFs with text and tables interspersed, and need the text extracted with any table structure preserved. Tabula/Camelot/pdfplumber can (sometimes) extract the tables, but it's not accurate enough for my use case.

sergenti

Oct 10, 2023

@nesuter try Adobe PDF Extract API, it's the best API in the market, I didn't find anything similar opensource

philip10

Nov 25, 2023

@nesuter you can try unstructured library

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment