LayoutLM-Byne-v0.1
The new SOTA in page retrieval from visually-rich documents.
We're glad to introduce one of the first document page embedding models, LayoutLM-Byne-v0.1.
With the rise of multimodal LLMs, there is a growing adoption of applying models directly to a document without pre-processing it first, as was done before with RAG. This approach is significantly more robust than text-only RAG on a large subset of documents, especially visually rich ones.
On the other hand, there is a significant lack of research focused on extracting a relevant page from a PDF or a DOCX document. Most practitioners would parse the page into text and apply regular text embeddings to the text, losing much positional context in the process.
LayoutLM [1] is an excellent solution for the problems because, at its core, it is a regular BERT-alike model, but it is uniquely capable of embedding positional information about the text alongside the text itself.
We have fine-tuned the model on the DocVQA [2] dataset, showing the potential improvement upon the current SOTA:
Model | HR@3 | HR@5 | HR@10 |
---|---|---|---|
all-mpnet-base-v2 | 0.2500 | 0.2900 | 0.3600 |
gte-base-en-v1.5 | 0.3454 | 0.3899 | 0.4554 |
snowflake-arctic-embed-m-v1.5 | 0.3548 | 0.4042 | 0.4573 |
LayoutLM-Byne (our model) | 0.3491 | 0.4269 | 0.5436 |
Improvement over best competitor | -1.61% | +5.62% | +18.87% |
It is important to highlight that the model is still in alpha, so further work is required to reveal its potential.
Usage
Please refer to the Colab workbook or the blog post to learn more!
Get in touch
Reach out to borys.nadykto@bynesoft.com if you'd like help with deploying the model in a commercial setting.
[1] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200).
[2] Mathew, M., Karatzas, D., & Jawahar, C. V. (2021). DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2200-2209).
- Downloads last month
- 14