shixuanleong
/

visualheist-base

@@ -6,103 +6,57 @@ tags:
 - vision
 - ocr
 - segmentation
-datasets:
-- yifeihu/TF-ID-arxiv-papers
 ---
-# TF-ID: Table/Figure IDentifier for academic papers
 ## Model Summary
-TF-ID (Table/Figure IDentifier) is a family of object detection models finetuned to extract tables and figures in academic papers created by [Yifei Hu](https://x.com/hu_yifei). They come in four versions:
-| Model   | Model size | Model Description |
-| ------- | ------------- |   ------------- |
-| TF-ID-base[[HF]](https://huggingface.co/yifeihu/TF-ID-base) | 0.23B  | Extract tables/figures and their caption text
-| TF-ID-large[[HF]](https://huggingface.co/yifeihu/TF-ID-large) (Recommended) | 0.77B  | Extract tables/figures and their caption text
-| TF-ID-base-no-caption[[HF]](https://huggingface.co/yifeihu/TF-ID-base-no-caption) | 0.23B  | Extract tables/figures without caption text
-| TF-ID-large-no-caption[[HF]](https://huggingface.co/yifeihu/TF-ID-large-no-caption) (Recommended) | 0.77B  | Extract tables/figures without caption text
-All TF-ID models are finetuned from [microsoft/Florence-2](https://huggingface.co/microsoft/Florence-2-large-ft) checkpoints.
-- The models were finetuned with papers from Hugging Face Daily Papers. All bounding boxes are manually annotated and checked by humans.
-- TF-ID models take an image of a single paper page as the input, and return bounding boxes for all tables and figures in the given page.
-- TF-ID-base and TF-ID-large draw bounding boxes around tables/figures and their caption text.
-- TF-ID-base-no-caption and TF-ID-large-no-caption draw bounding boxes around tables/figures without their caption text.
-**Large models are always recommended!**
-![image/png](https://huggingface.co/yifeihu/TF-ID-base/resolve/main/td-id-caption.png)
-Object Detection results format:
-{'\<OD>': {'bboxes': [[x1, y1, x2, y2], ...],
-'labels': ['label1', 'label2', ...]} }
 ## Training Code and Dataset
-- Dataset: [yifeihu/TF-ID-arxiv-papers](https://huggingface.co/datasets/yifeihu/TF-ID-arxiv-papers)
-- Code: [github.com/ai8hyf/TF-ID](https://github.com/ai8hyf/TF-ID)
 ## Benchmarks
-We tested the models on paper pages outside the training dataset. The papers are a subset of huggingface daily paper.
-Correct output - the model draws correct bounding boxes for every table/figure in the given page.
-| Model                                                         | Total Images | Correct Output | Success Rate |
-|---------------------------------------------------------------|--------------|----------------|--------------|
-| TF-ID-base[[HF]](https://huggingface.co/yifeihu/TF-ID-base)   | 258          | 251            | 97.29%       |
-| TF-ID-large[[HF]](https://huggingface.co/yifeihu/TF-ID-large) | 258          | 253            | 98.06%       |
-| Model                                                         | Total Images | Correct Output | Success Rate |
-|---------------------------------------------------------------|--------------|----------------|--------------|
-| TF-ID-base-no-caption[[HF]](https://huggingface.co/yifeihu/TF-ID-base-no-caption)   | 261          | 253            | 96.93%       |
-| TF-ID-large-no-caption[[HF]](https://huggingface.co/yifeihu/TF-ID-large-no-caption) | 261          | 254            | 97.32%       |
-Depending on the use cases, some "incorrect" output could be totally usable. For example, the model draw two bounding boxes for one figure with two child components.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-```python
-import requests
-from PIL import Image
-from transformers import AutoProcessor, AutoModelForCausalLM
-model = AutoModelForCausalLM.from_pretrained("yifeihu/TF-ID-base", trust_remote_code=True)
-processor = AutoProcessor.from_pretrained("yifeihu/TF-ID-base", trust_remote_code=True)
-prompt = "<OD>"
-url = "https://huggingface.co/yifeihu/TF-ID-base/resolve/main/arxiv_2305_10853_5.png?download=true"
-image = Image.open(requests.get(url, stream=True).raw)
-inputs = processor(text=prompt, images=image, return_tensors="pt")
-generated_ids = model.generate(
-    input_ids=inputs["input_ids"],
-    pixel_values=inputs["pixel_values"],
-    max_new_tokens=1024,
-    do_sample=False,
-    num_beams=3
-)
-generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
-parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
-print(parsed_answer)
-```
-To visualize the results, see [this tutorial notebook](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-finetune-florence-2-on-detection-dataset.ipynb) for more details.
 ## BibTex and citation info
 ```
-@misc{TF-ID,
-  author = {Yifei Hu},
-  title = {TF-ID: Table/Figure IDentifier for academic papers},
-  year = {2024},
-  publisher = {GitHub},
-  journal = {GitHub repository},
-  howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
-}
 ```

 - vision
 - ocr
 - segmentation
 ---
+# VisualHeist - figure, scheme and table segmentation from PDFs (with captions, headers & footnotes)
 ## Model Summary
+VisualHeist is an object detection model finetuned to extract tables and figures from PDFs. VisualHeist has two versions:
+- visualheist-base[[HF]](https://huggingface.co/shixuanleong/visualheist-base) (0.23B)
+- visualheist-large[[HF]](https://huggingface.co/shixuanleong/visualheist-large) (0.77B)
+**The base model is recommended if you are running it on low-RAM systems**
+The models are finetuned from [microsoft/Florence-2](https://huggingface.co/microsoft/Florence-2-large-ft) checkpoints. VisualHeist is inspired by and adapted from [yifeihu/TF-ID](https://huggingface.co/yifeihu/TF-ID-large)
+- The models were finetuned with 3435 figures and 1716 tables from 110 PDF articles across various publishers. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco-annotator).
+- TF-ID models take an image of a single paper page as the input, and return image files for all figures, schemes and tables in the given page.
 ## Training Code and Dataset
+- Dataset: [Zenodo repository](https://doi.org/10.5281/zenodo.14917752)
+- Code: [github.com/aspuru-guzik-group/MERMaid](https://github.com/aspuru-guzik-group/MERMaid)
 ## Benchmarks
+We manually curated a diverse evaluation dataset consisting of 121 literature articles covering a range of topics, including
+organic and inorganic chemistry, atmospheric science, batteries, materials science, metal-organic frameworks (MOFs), biology,
+and science education. These PDFs, published between 1949 and 2025, include both main articles and supplementary materials.
+We also additionally curated another collection of 98 literature articles (MERMaid-100) reporting novel reaction methodologies that spans
+three distinct chemical domains: organic electrosynthesis, photocatalysis, and organic synthesis.
+Additional performance discussion can be found from our [preprint article](XXXXXXX)
+The full DOI lists can be downloaded from our[Zenodo repository](https://doi.org/10.5281/zenodo.14917752).
+The evaluation results for visualheist-large are:
+|                                                       | Total Images | F1 score |
+|---------------------------------------------------------------|--------------|----------------|
+| All   | 1935        | 93%         |
+| Main   | 423         | 96%         |
+| pre-2000  | 260          | 93%         |
+| Supplementary Materials   | 1252         | 92%         |
+| MERMaid-100   | 100          | 99%         |
+## Running the Model
+Refer to our [github repository](https://github.com/aspuru-guzik-group/MERMaid) for detailed instructions on how to run the model
 ## BibTex and citation info
 ```
+<To be updated with our archive citation>
 ```