File size: 3,806 Bytes
1744a5c 13645cc 1744a5c 74f402a a0bf765 823de17 74f402a 1744a5c 136732b 3ded383 136732b 3ded383 136732b 05254d0 5d6e31d 05254d0 5d6e31d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
---
# DIT-base-layout-detection
We present the model cmarkea/dit-base-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document.
This is a fine-tuning of the model [dit-base](https://huggingface.co/microsoft/dit-base) on the [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
dataset. This model can jointly predict masks and bounding boxes for documentary objects. It is ideal for processing documentary corpora to be ingested into an
ODQA system.
This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.
## Performance
In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. We did not perform any post-processing
for the semantic segmentation. As for object detection, we only applied OpenCV's `findContours` without any further post-processing.
For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the
Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation
dataset of DocLayNet.
| Class | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:--------------:|:---------------:|:-----------:|:---------------:|
| Background | 94.98 | NA | NA |
| Caption | 75.54 | 55.61 | 72.62 |
| Footnote | 72.29 | 50.08 | 70.97 |
| Formula | 82.29 | 49.91 | 94.48 |
| List-item | 67.56 | 35.19 | 69 |
| Page-footer | 83.93 | 57.99 | 94.06 |
| Page-header | 62.33 | 65.25 | 79.39 |
| Picture | 78.32 | 58.22 | 92.71 |
| Section-header | 69.55 | 56.64 | 78.29 |
| Table | 83.69 | 63.03 | 90.13 |
| Text | 90.94 | 51.89 | 88.09 |
| Title | 61.19 | 52.64 | 70 |
## Benchmark
Now, let's compare the performance of this model with other models.
| Class | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:---------------------------------------------------------------------------------------------:|:---------------:|:-----------:|:---------------:|
| cmarkea/dit-base-layout-detection | 90.77 | 56.29 | 85.26 |
| [cmarkea/detr-layout-detection](https://huggingface.co/cmarkea/detr-layout-detection) | 84.23 | 43.84 | 71.98 |
### Direct Use
```python
import torch
from transformers import AutoImageProcessor, AutoModel
img_proc = AutoImageProcessor.from_pretrained(
"cmarkea/dit-base-layout-detection"
)
model = AutoModel.from_pretrained(
"cmarkea/dit-base-layout-detection"
)
with torch.inference_mode():
input_ids = img_proc(img, return_tensors='pt')
segmentation = model(**input_ids)
segmentation_mask = img_proc.post_process_semantic_segmentation(
segmentation,
target_sizes=[img.size[::-1]]
)
```
### Citation
```
@online{DeDitLay,
AUTHOR = {Cyrile Delestre},
URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
YEAR = {2024},
KEYWORDS = {Image Processing ; Transformers ; Layout},
}
``` |