microsoft/udop-large · How to extract sections from a text

Mar 10, 2024

I'm trying to use the model to extract information from a scientific article in a semi-structured format.
For instance, from the following image, I'm interested in extracting information for the abstract (section title, section content), for the keywords section, from the introduction, and so on.

Is there an off-the-shelf way to achieve that?

Many thanks,
Mihai G.

nielsr

Mar 10, 2024

•

edited Mar 10, 2024

Hi,

This document seems quite easily readable, so my recommendation here would be to apply OCR on the image + pass the extracted text to a large language model (like Mistral-7B), where you prompt is as follows (just a quick draft):

You are an expert in parsing text extracted from documents.

Here is all the text we extracted from the document: {text}

Your task is to extract all keywords from the text. Return the result as JSON.

UDOP is not really able to extract keywords from a given document, it's more aimed towards tasks such as document parsing/classification/DocVQA.

mihaigheorghe

Mar 11, 2024

@nielsr Thank you for your response.

That was my first option:

split the doc into pages > convert to images
OCR
merge
chunk into sentences
hierarchically summarize with a low ratio (to keep as much info as possible but get rid of OCR "noise"). Tried bart-large-cnn (and other summarizing specialized models), but gemma-2b-it (or 7b) does a way better job (although computationally more expensive).
extract information / answer questions with an open-source LLM from the summary.

I was wondering if UDOP might do a better or at least similar job with less hassle.
You answered my concern.

Thanks again
Mihai G.