LayoutLM-Byne-v0.1 / README.md
Boriscii's picture
Update README.md
76aaefb verified
|
raw
history blame
2.92 kB
metadata
datasets:
  - lmms-lab/DocVQA
language:
  - en
library_name: transformers
license: mit
tags:
  - document

LayoutLM-Byne-v0.1

The new SOTA in page retrieval from visually-rich documents.

Logo

We're glad to introduce one of the first document page embedding models, LayoutLM-Byne-v0.1.

With the rise of multimodal LLMs, there is a growing adoption of applying models directly to a document without pre-processing it first, as was done before with RAG. This approach is significantly more robust than text-only RAG on a large subset of documents, especially visually rich ones.

On the other hand, there is a significant lack of research focused on extracting a relevant page from a PDF or a DOCX document. Most practitioners would parse the page into text and apply regular text embeddings to the text, losing much positional context in the process.

LayoutLM [1] is an excellent solution for the problems because, at its core, it is a regular BERT-alike model, but it is uniquely capable of embedding positional information about the text alongside the text itself.

We have fine-tuned the model on the DocVQA [2] dataset, showing the potential improvement upon the current SOTA [4]:

Model HR@3 HR@5 HR@10
all-mpnet-base-v2 0.2500 0.2900 0.3600
gte-base-en-v1.5 0.3454 0.3899 0.4554
snowflake-arctic-embed-m-v1.5 0.3548 0.4042 0.4573
LayoutLM-Byne (our model) 0.3491 0.4269 0.5436
Improvement over best competitor -1.61% +5.62% +18.87%

Usage

Please refer to the Colab workbook or the blog post to learn more!

Get in touch

Reach out to borys.nadykto@bynesoft.com if you'd like help with deploying the model in commerical setting.

[1] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200).

[2] Mathew, M., Karatzas, D., & Jawahar, C. V. (2021). DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2200-2209).

[3] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982-3992).