How to extract sections from a text

#4
by mihaigheorghe - opened

I'm trying to use the model to extract information from a scientific article in a semi-structured format.
For instance, from the following image, I'm interested in extracting information for the abstract (section title, section content), for the keywords section, from the introduction, and so on.

article_intro.png

Is there an off-the-shelf way to achieve that?

Many thanks,
Mihai G.

Microsoft org
β€’
edited Mar 10

Hi,

This document seems quite easily readable, so my recommendation here would be to apply OCR on the image + pass the extracted text to a large language model (like Mistral-7B), where you prompt is as follows (just a quick draft):

You are an expert in parsing text extracted from documents.

Here is all the text we extracted from the document: {text}

Your task is to extract all keywords from the text. Return the result as JSON.

UDOP is not really able to extract keywords from a given document, it's more aimed towards tasks such as document parsing/classification/DocVQA.

@nielsr Thank you for your response.

That was my first option:

  • split the doc into pages > convert to images
  • OCR
  • merge
  • chunk into sentences
  • hierarchically summarize with a low ratio (to keep as much info as possible but get rid of OCR "noise"). Tried bart-large-cnn (and other summarizing specialized models), but gemma-2b-it (or 7b) does a way better job (although computationally more expensive).
  • extract information / answer questions with an open-source LLM from the summary.

I was wondering if UDOP might do a better or at least similar job with less hassle.
You answered my concern.

Thanks again
Mihai G.

Sign up or log in to comment