Floatings are common in academic literature and formal publications. In LaTeX, floating objects refer to containers that can include text, images, tables, code, algorithms, etc. Current Document Layout Analysis (DLA) models tend to handle such elements in a relatively crude and superficial manner, making it difficult to perform fine-grained layout analysis. To address this issue, models from the YOLO11 series (with five types: n, s, m, l, x) and the RT-DETR series (types l and x) have been trained using the Ultralytics framework on the Floating Layout Detection (FLD) and Floating Structure Analysis (FSA) datasets. These models are capable of automatically detecting and analyzing floating objects in document images. Additionally, the models' weights and training parameters have been made publicly available to facilitate collaborative research.
Download Methods
You can download the models either via Git or huggingface-cli
.
git clone https://huggingface.co/irhawks/floating-det
# or
git clone https://huggingface.co/irhawks/floating-fsa
Model Overview
Floating objects are frequently used in academic papers and books as a way to organize content. In LaTeX, floating objects can include images, tables, code blocks, or algorithms. These containers are automatically adjusted by LaTeX to fit the page layout. To aid indexing and readability, floating objects are often accompanied by metadata such as type, numbering, and captions. In complex cases, floating objects may contain multiple sub-elements, each with its own label or number.
The following are common types of floating objects in LaTeX:
- Figures: Used to contain images, typically defined within the
\begin{figure}
and\end{figure}
environments. - Tables: Used to contain tabular data, defined within the
\begin{table}
and\end{table}
environments. - Algorithms: Used to describe algorithms, typically utilizing packages such as
algorithm
,algorithm2e
, oralgorithmicx
. - Code: Used to display code blocks, usually defined with packages like
listings
orminted
.
Traditional DLA tasks have significant limitations, often handling only a few elements such as tables and figures independently. This approach leads to several issues:
- Lack of structural coherence: The relationship between the title and the main content of the floating object is often overlooked.
- Limited element types: Elements such as code blocks and algorithm blocks are often misrecognized as multiple paragraphs, reducing accuracy.
- Poor adaptability: Existing DLA models struggle to handle complex layouts, such as sub-pages and nested elements, reducing their robustness.
To address these challenges, we introduce two tasks:
- Floating Layout Detection (FLD): This task aims to detect the location and type of floating objects in document images, covering five types: figures, tables, algorithms, code, and others.
- Floating Structure Analysis (FSA): This task focuses on detecting the sub-structures within floating objects. When sub-elements such as sub-figures or sub-tables are present, the model can identify their positions and types, along with their corresponding captions. The six types of sub-elements include figures, tables, algorithms, code, captions, and others.
By training the YOLO11 and RT-DETR series models on the FLD and FSA datasets using the Ultralytics framework, we achieve automated floating object detection and analysis. To facilitate further research, the weights and training parameters are also publicly available.
The datasets are accessible at the following URLs:
Floating Detection Dataset:
- Huggingface: https://huggingface.co/datasets/irhawks/floating-det
- ModelScoep: https://modelscope.cn/datasets/irhawks/floating-det
Floating Structure Analysis Dataset:
- Huggingface: https://huggingface.co/datasets/irhawks/floating-fsa
- ModelScoep: https://modelscope.cn/datasets/irhawks/floating-fsa
Training Results
Even with the smallest YOLO11n model, satisfactory results can be achieved. Larger models have been released for research purposes.
Additional PR Curves for Other Models
- YOLO11s: