Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task
We are excited to share an update on the ongoing advancements in document AI at Hugging Face. Following the recent release of our two largest document datasets IDL and PDFa, we are intensively leveraging the annotations within these datasets, particularly focusing on the locations of the text lines. The Document AI team Pablo Montalvo, Ross Wightman, Dana Aubakirova are in the process of developing a data augmentation pipeline for documents. This pipeline is uniquely designed to modify both visual and textual features, enhancing the models' ability to process and understand complex documents at the pre-training stage.
Introduction
The need for data augmentation for documents arises from the differences between real-world documents and their digital scans. Real documents often undergo alterations through processes like printing, scanning, or photocopying. These physical changes can introduce visual distortions—such as folds, wrinkles, tears, ink variations, and annotations—that complicate digital interpretation.
Figure 1: Samples from the IDL dataset featuring inclined orientation, ink variations, handwritten annotations, and noise introduced by scanning.
The performance of many machine learning tasks involving documents, such as document classification and information extraction, is significantly affected by this noise. Despite this, these high-level tasks are often expected to operate effectively on such noisily-scanned documents. To address this, numerous methods have been developed to create document images that incorporate realistic noise artifacts and degradations, typically resulting from scanning, photocopying, and other office procedures. These efforts aim to assess and enhance model robustness and ability to read.
Another research avenue involves generating synthetic data to address the scarcity of annotated data and circumvent the complexity offered by real-world data. For example, WebSight is a synthetic dataset developed with the aim to train AI models to process and translate visual web designs into functional HTML code. By focusing on synthetic data, they managed to bypass the noise and complexity often found in real-world HTML, allowing AI models to learn efficiently. This concept extends to the domain of document processing as well. SynthDog, for example, synthesizes documents using backgrounds from ImageNet, textures from a database of paper photographs, and text from Wikipedia. The layout for these documents is generated through a simple rule-based algorithm that arranges the content into randomly stacked grids. To enhance realism, various image rendering techniques are applied, making the documents appear lifelike and easier for models to interpret and analyze. Another example, Genalog, is an open-source, cross-platform Python package designed for generating images of documents with synthetic noise, emulating the look of scanned analog documents. This tool allows for the addition of various text degradations, enhancing the challenge for document analysis models.
The standard approach for document understanding tasks involves initially pre-training the model to read text, followed by fine-tuning it to understand the document's content depending on the task. All the data augmentation efforts described previously involve either altering existing images or generating entirely new documents. In the first case, only images are altered as illustrated in the Figure 2 provided below, while the text remains unchanged.
Figure 2: A set of random augmentations applied to the document using the image transformations provided by Albumentations.
Our focus is on combining various text augmentation techniques and image rendering processes. This approach allows us to create documents with distorted layouts and semantically similar content, along with appropriately modified annotations. The primary objective is to ensure that the model can fully comprehend complex document layouts, font variations, and modified contexts.
We hypothesize that, similar to humans, the abilities to read and understand should be developed concurrently. Consequently, it is essential to design targeted data augmentation methods that enhance the model's resilience to visual text distortions and enable accurate interpretation of syntax, word order, and overall context, preserving semantic integrity right from the pretraining stage.
A closely related approach is seen in the pre-training stages of Pix2Struct, which involves recovering the masked parts of the parse, illustrated in Figure 3, similar to masked language modeling. Notably, Pix2Struct-Large significantly outperforms the previous visual state-of-the-art, Donut, in DocVQA by 9 points. Unlike top single-task methods like UDOP, which utilize an off-the-shelf OCR system, pre-trained encoders, and additional pre-training, Pix2Struct relies solely on visual representations without in-domain data yet achieves a competitive ANLS of 76.6. A major distinction in our approach is that we not only mask but also modify the content before re-rendering it.
Figure 3: An example of input-output pairs with masked visual signals in the image during the warmup stage of Pix2Struct, which consists of recovering the masked parts of the parse using the visual context as a cue.
Proposed Method
To augment document images, we begin by randomly selecting lines within the document. A hyperparameter controls the number of lines to be modified. We then identify the text locations using bounding boxes and obscure these areas with a white patch, as shown in Figure 1.
Figure 4: The pipeline starts by randomly selecting and obscuring specific lines of text within each page of the document using white patches, based on a predefined hyperparameter that controls the proportion of text to be modified. After obscuring, each line of text is modified with text augmentation methodschosen at random, and then re-rendered into the original layout. The annotations of the pages are changed correspondingly.
Next, we apply one of several text augmentation methods to the corresponding lines of text, which are commonly utilized in text generation tasks. These methods include Random Insertion, Deletion, Swap, and Keyword Replacement. Currently, we employ tools like NLTK and RAKE for these operations. However, there is significant potential for utilizing LLMs to generate these types of text augmentations. Nonetheless, integrating such dynamic augmentations during training could be computationally costly.
After modifying the text, we re-render it onto the document, using the original bounding box size as a proxy for the new text's font size. Corresponding changes are also made to the JSON files. This process results in a dataset with semantically similar textual content and visually distorted images.
Figure 5: A detailed illustration of the augmented input pairs. The highlighted lines illustrate the changed lines.
What is next?
As was mentioned earlier, reading and thinking are closely intertwined cognitive processes that can occur simultaneously. During reading, we engage in multiple cognitive tasks such as word recognition, sentence interpretation, and meaning construction. This involves not only decoding the written words but also activating background knowledge, making inferences, and integrating new information with what we already know. We want the model to be able to identify keywords, understand synonyms, fill in gaps when information is missing, and grasp context despite variations in word order that highlights the sophisticated nature of reading comprehension. These capabilities demonstrate that thinking is not merely adjacent to reading but is fundamentally integrated, essential for interpreting texts of all types.
A pertinent question then arises regarding the optimal stage in a learning or training process for integrating these cognitive tasks: At what point should thinking be explicitly incorporated? This is crucial for developing a robust model that effectively mimics human reading and comprehension skills.
In this context, we need to evaluate the effectiveness of introducing data augmentation techniques at different stages of model training. Specifically, we are exploring whether it is more beneficial to apply these techniques after the model has been trained on real data, from the very beginning of training, or through a gradual integration throughout the training process. The figure below depicts the expected impact of data augmentation on the model's cognitive task performance under these three scenarios. We are actively analyzing this issue and will provide updates as new findings emerge.
Figure 6: The expected curves for the "cognitive ability" of the model in response to the addition of data augmentation at various pre-training stages.
Figure 7: More examples of the augmented samples. Note the difference in the font and fontsize, which introduces additional challenges.
Aknowledgements
Huge thanks to Pablo Montalvo, Ross Wightman, Vaibhav Srivastav for their detailed feedback on this blog post.
References
@inproceedings{lee2023pix2struct,
title={Pix2struct: Screenshot parsing as pretraining for visual language understanding},
author={Lee, Kenton and Joshi, Mandar and Turc, Iulia Raluca and Hu, Hexiang and Liu, Fangyu and Eisenschlos, Julian Martin and Khandelwal, Urvashi and Shaw, Peter and Chang, Ming-Wei and Toutanova, Kristina},
booktitle={International Conference on Machine Learning},
pages={18893--18912},
year={2023},
organization={PMLR}
}
@inproceedings{groleau2023augraphy,
title={Augraphy: A data augmentation library for document images},
author={Groleau, Alexander and Chee, Kok Wei and Larson, Stefan and Maini, Samay and Boarman, Jonathan},
booktitle={International Conference on Document Analysis and Recognition},
pages={384--401},
year={2023},
organization={Springer}
}
@inproceedings{kim2022ocr,
title={Ocr-free document understanding transformer},
author={Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun},
booktitle={European Conference on Computer Vision},
pages={498--517},
year={2022},
organization={Springer}
}
@article{feng2020genaug,
title={Genaug: Data augmentation for finetuning text generators},
author={Feng, Steven Y and Gangal, Varun and Kang, Dongyeop and Mitamura, Teruko and Hovy, Eduard},
journal={arXiv preprint arXiv:2010.01794},
year={2020}
}
@article{laurenccon2024unlocking,
title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset},
author={Lauren{\c{c}}on, Hugo and Tronchon, L{\'e}o and Sanh, Victor},
journal={arXiv preprint arXiv:2403.09029},
year={2024}
}