arxiv:2309.10952

LMDX: Language Model-based Document Information Extraction and Localization

Published on Sep 19, 2023

· Submitted by

akhaliq on Sep 21, 2023

#1 Paper of the day

Upvote

Authors:

Vincent Perot ,

Kai Kang ,

Florian Luisier ,

Xiaoyu Sun ,

Ramya Sree Boppana ,

Zilong Wang ,

Abstract

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

View arXiv page View PDF Add to collection

Community

librarian-bot

Oct 4, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

masusuka

Oct 25, 2023

How can i use this model for fact extraction from mu owns documents?

ArunVP3799

Oct 30, 2023

Can this strategy be used with lower parameter models like Flan-UL2?

xingwang1234

Nov 1, 2023

frustrated that the paper did not tell model size, and data size used in first-stage fine-tuning.

vperot

Paper author Nov 8, 2023

•

edited Nov 10, 2023

Hi all! Thanks all for your interest and reading the paper! Answering some of the questions below:

How can i use this model for fact extraction from mu owns documents?

Since the results are based on PaLM 2, we can't release the model, but we tried to describe the methodology in detail so it could replicated on other LLMs.

Can this strategy be used with lower parameter models like Flan-UL2?

We tried with PaLM 2-XXS, and the model did not follow the specified target schema (skipping particular entity types or extracting non-specified entity types) nor the entity value syntax at all (skipping the segment identifiers). Starting at PaLM 2-S, the model followed the target schema and syntax near perfectly. So, in our experience, it requires a certain model size for this methodology to start working well.
Another point is that LMDX requires long input and output token length (see Table 5 in appendix for the stats). We used 6144 max input token length and 2048 max output token length for our experiments. I believe Flan-UL2 has shorter length which might not be enough for all types of documents.

frustrated that the paper did not tell model size, and data size used in first-stage fine-tuning.

Sorry for the frustration! Unfortunately, we can't divulge any model size information. Regarding the data in first stage finetuning, it was a total of 8 government form templates from which we synthesized 1000 documents using an internal tool (filling the template with synthetic values). We also included the payment dataset, so the total is in the order of 10k documents.

mandeepbagga

Nov 9, 2023

Could we use this method to create a tool similar to MathPix? I'm building an educational tool that uses LLMs to explain solution of Physics and Math question. Here I'm using MathPix to extract questions from previous question papers, this questions and there options include Math formulas so I want to create a tool which handles this and convert's it into latex.

vperot

Paper author Nov 10, 2023

Could we use this method to create a tool similar to MathPix? I'm building an educational tool that uses LLMs to explain solution of Physics and Math question. Here I'm using MathPix to extract questions from previous question papers, this questions and there options include Math formulas so I want to create a tool which handles this and convert's it into latex.

No, I don't think it would apply. LMDX is meant for going from a semi-structured document to structured format containing the specified entities. It does not transform the entity values itself.

mandeepbagga

Nov 10, 2023

Could we use this method to create a tool similar to MathPix? I'm building an educational tool that uses LLMs to explain solution of Physics and Math question. Here I'm using MathPix to extract questions from previous question papers, this questions and there options include Math formulas so I want to create a tool which handles this and convert's it into latex.

No, I don't think it would apply. LMDX is meant for going from a semi-structured document to structured format containing the specified entities. It does not transform the entity values itself.

Hello Thanks for the quick reply.
I know this is not the best place to ask this. How would one approach the problem I mentioned?

rdittrich

Nov 16, 2023

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

orby-yanan

Dec 13, 2023

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

We have been trying to fine-tune Mistral 7b using a similar approach. Unfortunately, it's very challenging to teach Mistral 7b to leverage coordinate tokens. Not sure if it's caused by the lack of data for base ee t raining However, we have just discovered a more efficient approach to address the layout challenge.

mfumagalli68

Dec 19, 2023

This comment has been hidden

hitchhiker3010

Jan 4

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

We have been trying to fine-tune Mistral 7b using a similar approach. Unfortunately, it's very challenging to teach Mistral 7b to leverage coordinate tokens. Not sure if it's caused by the lack of data for base ee t raining However, we have just discovered a more efficient approach to address the layout challenge.

Hey @orby-yanan would you be able to shed some light on your discoveries? What worked and What didn't and What's the efficient approach that you're talking about?

orby-yanan

Feb 21

•

edited Feb 21

We have found ASCII art representation to be more effective and efficient in most cases for two reasons:

Most commonly used LLMs have already been trained to understand layout described in ASCII art format. You can expect the model to work well when you format your data that way without fine-tuning.
ASCII art is also a token-efficient representation. " " (80 whitespaces) is just one token with GPT4 tokenizer. You end up using less tokens than adding coordinate tokens to each line.

For example,
Model name: [1,5]
Gemini Pro [80,5]

can be converted to
Model name: (40 spaces) Gemini Pro

orby-yanan

Jan 4

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

We have been trying to fine-tune Mistral 7b using a similar approach. Unfortunately, it's very challenging to teach Mistral 7b to leverage coordinate tokens. Not sure if it's caused by the lack of data for base ee t raining However, we have just discovered a more efficient approach to address the layout challenge.

Hey @orby-yanan would you be able to shed some light on your discoveries? What worked and What didn't and What's the efficient approach that you're talking about?

A fine-tuned Mistral-7b is able to extract entities from documents using a prompt similar to what LMDX described.
We have also tried adding coordinate tokens to each line and shuffling lines to see whether fine-tuned Mistral-7b can still extract entities correctly with shuffled lines. Unfortunately, it didn't work.

GPT-4 on the other hand can leverage coordinate tokens with proper prompting. One explanation is that a 7b model could be too small to learn to process coordinate tokens. For the more efficient approach, we will need to run more experiments and then we can consider uploading a preprint.

rajdeep123

Jan 8

Hi, thank you for the nice paper. I was trying to run the PyPI package (https://pypi.org/project/lmdx-flow/), got stuck at this point:
"answers = P.postprocess_all_chunks(llm_responses)"
What is the format of the "llm_responses" the pipeline expects? Is it possible for you to share any notebook on how the generated responses from Mistral-7b can be used as an input for example? Thanks in advance.

AmreenI

Feb 9

•

edited Feb 9

Hi,

I'm trying to use the lmdx-flow package (https://pypi.org/project/lmdx-flow/) for document information extraction and I'm encountering an issue with the expected format of llm_responses in the P.postprocess_all_chunks function.

answers = P.postprocess_all_chunks(llm_responses)

Unfortunately, the documentation or error messages haven't provided sufficient clarity on the exact format that this function expects for llm_responses. Could you please explain the required structure and data types for this variable?

Thanks in advance!

xingwang1234

Feb 14

This comment has been hidden

orby-yanan

Feb 17

GPT-4 on the other hand can leverage coordinate tokens with proper prompting. One explanation is that a 7b model could be too small to learn to process coordinate tokens. For the more efficient approach, we will need to run more experiments and then we can consider uploading a preprint.

can you share details about how to design the prompt to utilize coordinate information? Thanks.

It was a few shot prompt. We just append coarse coordinates at the end of each line.