license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision
- layout-analysis
- object-detection
datasets:
- ds4sd/DocLayNet-v1.1
base_model:
- microsoft/Florence-2-large-ft
Florence-2-DocLayNet-Fixed
Model Summary
We finetuned the Florence-2-large-ft [HF] model using the [DocLayNet-v1.1] dataset. To prevent the model from generating hallucinated class names, we re-mapped all class names to single tokens:
Original Class Names | New Class Names |
---|---|
Caption | Cap |
Footnote | Footnote |
Formula | Math |
List-item | List |
Page-footer | Bottom |
Page-header | Header |
Picture | Picture |
Section-header | Section |
Table | Table |
Text | Text |
Title | Title |
By applying this simple change, we observed 7% improvement of mAP50-95 score on the DocLayNet test set. The training and inference was also faster thanks to fewer tokens used by the class names.
From the mAP50-95 score, this model is far from SOTA on the DocLayNet test set (70%). Much smaller Yolo models (github.com/ppaanngggg/yolo-doclaynet)[https://github.com/ppaanngggg/yolo-doclaynet] have much better benchmark results (~79%). On the subset of scientific articles, this model performed on par with the best Yolo models (87%) in terms of mAP50-95.
However, after we performed some qualitative analysis (paper coming soon), we found that Florence-2 is much better at drawing bounding boxes with clean edges. Yolo models sometimes cut text in the middle or draw multiple bounding boxes on the same object. These behaviors are not seriously published in mAP50-95 but are painful to deal with in real-world use cases. When calculating the mAP scores, we had to manually set the confidence score as 1 for all Florence-2 output.
We release the finetuned model weights for the community to further investigate related research topics.
How to Get Started with the Model
Use the code below to get started with the model.
For non-CUDA environments, please check out this post for a simple patch: https://huggingface.co/microsoft/Florence-2-base/discussions/4
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
prompt = "<OD>"
url = "https://huggingface.co/yifeihu/TF-ID-base/resolve/main/arxiv_2305_10853_5.png?download=true"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
do_sample=False,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)
To visualize the results, see this tutorial notebook for more details.
BibTex and citation info
@misc{TF-ID,
author = {Yifei Hu},
title = {TF-ID: Table/Figure IDentifier for academic papers},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
}
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
doi = {10.1145/3534678.353904},
url = {https://arxiv.org/abs/2206.01062},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022}
}