Papers
arxiv:2309.10952

LMDX: Language Model-based Document Information Extraction and Localization

Published on Sep 19, 2023
· Featured in Daily Papers on Sep 21, 2023
Authors:
,
,
,

Abstract

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

How can i use this model for fact extraction from mu owns documents?

Can this strategy be used with lower parameter models like Flan-UL2?

frustrated that the paper did not tell model size, and data size used in first-stage fine-tuning.

Hi all! Thanks all for your interest and reading the paper! Answering some of the questions below:

How can i use this model for fact extraction from mu owns documents?

Since the results are based on PaLM 2, we can't release the model, but we tried to describe the methodology in detail so it could replicated on other LLMs.

Can this strategy be used with lower parameter models like Flan-UL2?

We tried with PaLM 2-XXS, and the model did not follow the specified target schema (skipping particular entity types or extracting non-specified entity types) nor the entity value syntax at all (skipping the segment identifiers). Starting at PaLM 2-S, the model followed the target schema and syntax near perfectly. So, in our experience, it requires a certain model size for this methodology to start working well.
Another point is that LMDX requires long input and output token length (see Table 5 in appendix for the stats). We used 6144 max input token length and 2048 max output token length for our experiments. I believe Flan-UL2 has shorter length which might not be enough for all types of documents.

frustrated that the paper did not tell model size, and data size used in first-stage fine-tuning.

Sorry for the frustration! Unfortunately, we can't divulge any model size information. Regarding the data in first stage finetuning, it was a total of 8 government form templates from which we synthesized 1000 documents using an internal tool (filling the template with synthetic values). We also included the payment dataset, so the total is in the order of 10k documents.

Could we use this method to create a tool similar to MathPix? I'm building an educational tool that uses LLMs to explain solution of Physics and Math question. Here I'm using MathPix to extract questions from previous question papers, this questions and there options include Math formulas so I want to create a tool which handles this and convert's it into latex.

Paper author

Could we use this method to create a tool similar to MathPix? I'm building an educational tool that uses LLMs to explain solution of Physics and Math question. Here I'm using MathPix to extract questions from previous question papers, this questions and there options include Math formulas so I want to create a tool which handles this and convert's it into latex.

No, I don't think it would apply. LMDX is meant for going from a semi-structured document to structured format containing the specified entities. It does not transform the entity values itself.

Could we use this method to create a tool similar to MathPix? I'm building an educational tool that uses LLMs to explain solution of Physics and Math question. Here I'm using MathPix to extract questions from previous question papers, this questions and there options include Math formulas so I want to create a tool which handles this and convert's it into latex.

No, I don't think it would apply. LMDX is meant for going from a semi-structured document to structured format containing the specified entities. It does not transform the entity values itself.

Hello Thanks for the quick reply.
I know this is not the best place to ask this. How would one approach the problem I mentioned?

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

We have been trying to fine-tune Mistral 7b using a similar approach. Unfortunately, it's very challenging to teach Mistral 7b to leverage coordinate tokens. Not sure if it's caused by the lack of data for base ee t raining However, we have just discovered a more efficient approach to address the layout challenge.

This comment has been hidden

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

We have been trying to fine-tune Mistral 7b using a similar approach. Unfortunately, it's very challenging to teach Mistral 7b to leverage coordinate tokens. Not sure if it's caused by the lack of data for base ee t raining However, we have just discovered a more efficient approach to address the layout challenge.

Hey @orby-yanan would you be able to shed some light on your discoveries? What worked and What didn't and What's the efficient approach that you're talking about?

·

We have found ASCII art representation to be more effective and efficient in most cases for two reasons:

  1. Most commonly used LLMs have already been trained to understand layout described in ASCII art format. You can expect the model to work well when you format your data that way without fine-tuning.
  2. ASCII art is also a token-efficient representation. " " (80 whitespaces) is just one token with GPT4 tokenizer. You end up using less tokens than adding coordinate tokens to each line.

For example,
Model name: [1,5]
Gemini Pro [80,5]

can be converted to
Model name: (40 spaces) Gemini Pro

Hi, has anybody tried LMDX e.g. with Mistral 7b or Llama 2 7b or 13b?

We have been trying to fine-tune Mistral 7b using a similar approach. Unfortunately, it's very challenging to teach Mistral 7b to leverage coordinate tokens. Not sure if it's caused by the lack of data for base ee t raining However, we have just discovered a more efficient approach to address the layout challenge.

Hey @orby-yanan would you be able to shed some light on your discoveries? What worked and What didn't and What's the efficient approach that you're talking about?

A fine-tuned Mistral-7b is able to extract entities from documents using a prompt similar to what LMDX described.
We have also tried adding coordinate tokens to each line and shuffling lines to see whether fine-tuned Mistral-7b can still extract entities correctly with shuffled lines. Unfortunately, it didn't work.

GPT-4 on the other hand can leverage coordinate tokens with proper prompting. One explanation is that a 7b model could be too small to learn to process coordinate tokens. For the more efficient approach, we will need to run more experiments and then we can consider uploading a preprint.

Hi, thank you for the nice paper. I was trying to run the PyPI package (https://pypi.org/project/lmdx-flow/), got stuck at this point:
"answers = P.postprocess_all_chunks(llm_responses)"
What is the format of the "llm_responses" the pipeline expects? Is it possible for you to share any notebook on how the generated responses from Mistral-7b can be used as an input for example? Thanks in advance.

Hi,

I'm trying to use the lmdx-flow package (https://pypi.org/project/lmdx-flow/) for document information extraction and I'm encountering an issue with the expected format of llm_responses in the P.postprocess_all_chunks function.

answers = P.postprocess_all_chunks(llm_responses)

Unfortunately, the documentation or error messages haven't provided sufficient clarity on the exact format that this function expects for llm_responses. Could you please explain the required structure and data types for this variable?

Thanks in advance!

This comment has been hidden

GPT-4 on the other hand can leverage coordinate tokens with proper prompting. One explanation is that a 7b model could be too small to learn to process coordinate tokens. For the more efficient approach, we will need to run more experiments and then we can consider uploading a preprint.

can you share details about how to design the prompt to utilize coordinate information? Thanks.

It was a few shot prompt. We just append coarse coordinates at the end of each line.

Has anyone has the dataset reference where there is a line items as well.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.10952 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.10952 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.10952 in a Space README.md to link it from this page.

Collections including this paper 36