metadata

license: cc-by-4.0
datasets:
  - CATMuS/medieval-segmentation
pipeline_tag: object-detection
tags:
  - medieval
  - manuscript

Florence 2 Medieval Zone Object Detection

This is Microsoft's Florence 2 model trained for 10 epochs with CATMuS Medieval Segmentation dataset with a learn rate of 1e-6. This model would not be possible without the numerous annotators behind the various datasets available on HTR-United (See dataset for details). A special thanks to Thibault Clérice who converted the original CATMuS dataset (for HTR) to a segmentation dataset.

Model Details

Developed by: William J.B. Mattingly
License: CC-BY 4.0
Finetuned from model: Florence-2-base-ft

Labels

The following table describes the labels, the ones used to train this model, the counts of those labels (multiples per image), and the definition of those labels with a link to the original documentation.

Label	Zone	Line	Train Count	Validation Count	Test Count	Definition
DefaultLine		✓	81702	13554	12209	A line of text that is not distinguished by any particular features and is part of the main text flow.
InterlinearLine		✓	2808	27	2234	A line of text written between two lines of main text, typically containing glosses, translations, or comments.
MainZone	✓		2314	365	275	The main textual zone of a page, usually containing the main body of text.
HeadingLine		✓	1381	701	135	A line of text that functions as a heading or title for a section of the main text.
MarginTextZone	✓		916	146	199	A text zone in the margin of a page, often containing annotations, commentaries, or other secondary information.
DropCapitalZone	✓		1566	102	124	A zone containing a large ornamental initial letter of a paragraph or section, typically extending below the first line of text.
NumberingZone	✓		632	102	94	A zone containing page numbers, folio numbers, or other numerical identifiers for the page.
TironianSignLine			282	0	0	A line containing Tironian notes, an ancient system of shorthand.
DropCapitalLine			1175	105	92	A line of text that begins with a drop capital.
RunningTitleZone	✓		340	91	18	A zone containing a running title, typically located at the top of a page and repeating throughout a section or the entire document.
GraphicZone	✓		300	7	10	A zone containing non-textual elements such as images, drawings, or decorative elements.
DigitizationArtefactZone			28	0	0	A zone containing artefacts from the digitization process, such as color bars or reference marks.
QuireMarksZone	✓		86	9	8	A zone containing marks used to indicate the gathering or quire to which a leaf belongs, often found at the bottom of the page.
StampZone	✓		39	5	4	A zone containing a stamp, such as a library stamp or ownership mark.
DamageZone	✓		12	1	0	A zone indicating an area of the page that has been damaged or is otherwise illegible due to physical deterioration.
MusicZone	✓		179	0	0	A zone containing musical notation.
MusicLine			167	0	0	A line containing musical notation.
TitlePageZone	✓		4	1	1	A zone encompassing the entire title page of a book or document.
SealZone	✓		3	0	0	A zone containing a seal, typically used for authentication or closure of a document.

How to Get Started with the Model

Use the code below to get started with the model. All models are trained with float16.

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import os
from unittest.mock import patch

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.dynamic_module_utils import get_imports
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Mac solution => https://huggingface.co/microsoft/Florence-2-large-ft/discussions/4
def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
    """Work around for https://huggingface.co/microsoft/phi-1_5/discussions/72."""
    if not str(filename).endswith("/modeling_florence2.py"):
        return get_imports(filename)
    imports = get_imports(filename)
    imports.remove("flash_attn")
    return imports


with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):

    model = AutoModelForCausalLM.from_pretrained("medieval-data/florence2-medieval-bbox-zone-detection", trust_remote_code=True)
    processor = AutoProcessor.from_pretrained("medieval-data/florence2-medieval-bbox-zone-detection", trust_remote_code=True)

def process_image(url):
    prompt = "<OD>"

    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(text=prompt, images=image, return_tensors="pt")

    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    result = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
    return result, image


image = "https://huggingface.co/datasets/CATMuS/medieval-segmentation/resolve/main/data/train/cambridge-corpus-christi-college-ms-111/page-002-of-003.jpg"

result, image = process_image(image)
fig, ax = plt.subplots(1, figsize=(15, 15))
ax.imshow(image)

# Add bounding boxes and labels to the plot
for bbox, label in zip(result['<OD>']['bboxes'], result['<OD>']['labels']):
    x, y, width, height = bbox[0], bbox[1], bbox[2] - bbox[0], bbox[3] - bbox[1]
    rect = patches.Rectangle((x, y), width, height, linewidth=2, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    plt.text(x, y, label, fontsize=12, bbox=dict(facecolor='yellow', alpha=0.5))

# Display the plot
plt.show()