Handwritten Signature Detection Model

Community Article Published March 10, 2025

Abstract

This article presents an open-source project for automated signature detection in document processing, structured into four key phases:

  • Dataset Engineering: Curation of a hybrid dataset through aggregation of two public collections.
  • Architecture Benchmarking: Systematic evaluation of state-of-the-art object detection architectures (YOLO series, DETR variants, and YOLOS), focusing on accuracy, computational efficiency, and deployment constraints.
  • Model Optimization: Leveraged Optuna for hyperparameter tuning, yielding a 7.94% F1-score improvement over baseline configurations.
  • Production Deployment: Utilized Triton Inference Server for OpenVINO CPU-optimized inference.

Experimental results demonstrate a robust balance between precision, recall, and inference speed, validating the solution's practicality for real-world applications.

The full pipeline, including code, models, and data, is publicly available on HuggingFace Hub and GitHub to foster reproducibility and community adaptation.

Table 1: Key Research Features
Resource Links / Badges Details
Model Files HF Model Different formats of the final models
PyTorch ONNX TensorRT
Dataset – Original Roboflow 2,819 document images annotated with signature coordinates
Dataset – Processed HF Dataset Augmented and pre-processed version (640px) for model training
Notebooks – Model Experiments Colab W&B Training Complete training and evaluation pipeline with selection among different architectures (yolo, detr, rt-detr, conditional-detr, yolos)
Notebooks – HP Tuning Colab W&B HP Tuning Optuna trials for optimizing the precision/recall balance
Inference Server GitHub Complete deployment and inference pipeline with Triton Inference Server
OpenVINO Docker Triton
Live Demo HF Space Graphical interface with real-time inference
Gradio Plotly

1. Introduction

The application of automation in document processing has received much attention in both industry and academia. In this context, handwritten signature detection presents certain challenges, including orientation, lighting, noise, document layout, cursive letters in the document content, signatures in varying positions, among others. This paper describes the end-to-end development of a model for this task, including architecture selection, dataset creation, hyperparameter optimization, and production deployment.


2. Methodology

2.1 Object Detection Architectures

Object detection in computer vision has improved significantly, balancing speed and precision for use cases such as autonomous driving and surveillance applications.

The field has evolved rapidly, with architectures generally falling into certain paradigms:

  • CNN-based Two-Stage Detectors: Models from the R-CNN family first identify potential object regions before classifying them
  • CNN-based One-Stage Detectors: YOLO family and SSD models that predict bounding boxes and classes in one pass
  • Transformer-based Detectors: DETR-based models applying attention mechanisms for end-to-end detection
  • Hybrid Architectures: Systems that combine CNN efficiency with Transformer contextual understanding
  • Zero-shot Detectors: Capable of recognizing unfamiliar object classes without specific training
Model Architecture/Design Latency Applications Precision (mAP) – Size Variants Training Frameworks License Parameters (M)
YOLOv8 CNN (One-Stage), Yolov5 adapt -> C2f Block, SSPF and decoupled head Very Low (1.47-14.37) [T4 GPU, TensorRT FP16] Real-time detection, tracking 37.4–53.9 [n-x] (COCO) PyTorch, Ultralytics, KerasCV, TensorFlow AGPL-3.0 3.2–68.2
YOLOv9 CNN, GELAN (Generalized Efficient Layer Aggregation Network), Programmable Gradient Information (PGI), CSP-ELAN blocks Low (2.3-16.77) [T4 GPU, TensorRT FP16] Real-time detection 38.3–55.6 [t-e] (COCO) PyTorch, Ultralytics AGPL-3.0 4.5–71.0
YOLOv10 CNN, Consistent Dual Assignments, Holistic Efficiency-Accuracy Design (Lightweight Head, Decoupled Downsampling, Rank-guided Blocks, Large-kernel Conv, Partial Self-Attention) Very Low (1.84–10.79 ms) [T4 GPU, TensorRT FP16] Real-time detection 38.5–54.4 [n-x] (COCO) PyTorch, Ultralytics AGPL-3.0 2.3–29.5
YOLO11 CNN + Attention, PSA, Dual Label Assignment, C3K2 block Very Low (1.5-11.3) [T4 GPU, TensorRT FP16] Real-time detection 43.5–58.5 [n-x] (COCO) PyTorch, Ultralytics AGPL-3.0 5.9–89.7
RTMDet CNN (One-Stage), CSPDarkNet, CSP-PAFPN neck, soft label dynamic assignment Low (2.34–18.8 ms) [T4 GPU, TensorRT FP16] Real-time detection 41.1–52.8 [tiny-x] (COCO) PyTorch, MMDetection Apache-2.0 4.8–94.9
DETR Transformer (Encoder-Decoder), CNN Backbone (ResNet), Bipartite Matching Loss, Set-based Prediction Moderate (12–28 ms) [T4 GPU, TensorRT FP16] Object detection 42.0–44.0 [R50-R101] (COCO) PyTorch, Hugging Face, PaddlePaddle Apache-2.0 41–60
RT-DETR Transformer (Encoder-Decoder), Efficient Hybrid Encoder, Uncertainty-Minimal Query Selection, CNN Backbone (ResNet) Very Low 9.3–13.5 ms (108–74 FPS) [T4 GPU, TensorRT FP16] Real-time detection 53.1–56.2 AP [R50-R101] (COCO) PyTorch, Hugging Face, PaddlePaddle, Ultralytics Apache-2.0 20.0–67.0
DETA Transformer (Two-Stage), CNN Backbone, IoU-based Label Assignment, NMS for Post-Processing Moderate ~100 ms [V100 GPU] High-precision detection 50.1-62.9 (COCO) PyTorch, Detectron2, Hugging Face Apache-2.0 48–52
GroundingDINO Transformer-based (DINO + grounded pre-training), dual-encoder-single-decoder, Swin-T/Swin-L image backbone, BERT text backbone, feature enhancer, language-guided query selection, cross-modality decoder ~119 ms (8.37 FPS on T4 GPU)* Open-set detection 52.5 AP (COCO zero-shot), 62.6–63.0 AP (COCO fine-tuned) [Swin-T (172M) / Swin-L (341M)] (COCO) PyTorch, Hugging Face Apache-2.0 172–341
YOLOS Vision Transformer (ViT) with detection tokens, encoder-only, bipartite matching loss, minimal 2D inductive biases Moderate (11.9–370.4 ms) [1080Ti GPU] Object detection 28.7–42.0 [Tiny-Base] (COCO) PyTorch, Hugging Face Apache-2.0 6.5–127
Table 2: Main Neural Network Architectures for Object Detection

Check this leaderboard for the latest updates on object detection models: Object Detection Leaderboard.

Background and Evolution

Object detection has evolved from the conventional CNN-based models like Faster R-CNN and SSD to more recent transformer-based models like DETR, released in 2020 by Facebook AI. This is targeted at enhancing accuracy through the application of transformer models, which possess a superior capacity for handling global context. Recent models like RT-DETR and YOLOv11 continue to refine this balance, focusing on real-time performance and precision.

YOLO Series

You Only Look Once (YOLO) was introduced in 2016 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, revolutionizing object detection by proposing a single-stage detector. While two-stage approaches first create region proposals and then classify and classify them, YOLO predicts bounding boxes and object classes in a single pass of the network. This design makes YOLO very fast and ideal for edge devices and real-time computing tasks like security surveillance, traffic monitoring, and industrial inspection.

Figure 1: YOLO family timeline.
Source: What is YOLO? The Ultimate Guide [2025]

Year after year, the YOLO family has been improved over and over again. Each version had new methods and advancements to compensate for the others' weaknesses:

  • YOLOv1: First to remove region proposals by scanning the entire image in one forward pass, grabbing both global and local context at once.
  • YOLOv2 and YOLOv3: Brought forth major advancements like Feature Pyramid Networks (FPN), multi-scale training, and the application of anchor boxes, improving the model's capability to detect objects of different sizes.
  • YOLOv4: Marked a major architectural upgrade by adopting a CSPDarknet53-PANet-SPP backbone and integrating novel techniques like a bag of specials, a bag of freebies, genetic algorithms, and attention modules to improve both accuracy and speed.
  • YOLOv5: While architecturally similar to YOLOv4, its implementation in PyTorch and the streamlined training process within the Ultralytics framework contributed to its widespread adoption.
  • YOLO-R, YOLO-X, YOLOv6, and YOLOv7: These versions explored multitask learning (YOLO-R), anchor-free detection, decoupled heads, reparameterization, knowledge distillation, and quantization—allowing improved scaling and optimization of the models.
  • YOLOv8 and YOLO-NAS: YOLOv8 builds upon YOLOv5 with minor changes, such as replacing the CSPLayer with the C2f module, and incorporates optimized loss functions for small object detection. YOLO-NAS stands out as a model generated through neural architecture search (NAS), employing quantization-aware blocks to achieve an optimal balance between speed and accuracy.

One of the challenges with the YOLO architecture is the large number of different versions, as the name can be used freely, and not all recent versions necessarily bring significant improvements. Indeed, when a new model is released under the YOLO name, it often shows superior metric results, but the differences tend to be minimal.

Figure 2: The Incremental Pursuit of SOTA in mAP.

Moreover, a model achieving superior metrics on a general evaluation dataset, such as COCO, does not necessarily mean it will perform better in your specific use case. There are also differences in model size, inference and training time, library and hardware compatibility, and differences in the characteristics and quality of your data.

image

Figure 3: YOLO Performance Comparison.
Source: Ultralytics YOLO11

The family is distinguished by its unified and optimized design, making it deployable on low-resource devices and real-time applications. The diversity of YOLO versions presents challenges in choosing the ideal model for a specific use case.

DETR

The DEtection TRansformer (DETR) introduces an innovative approach by combining convolutional neural networks with transformer architecture, eliminating traditional and complex post-processing steps used in other methods, such as Non-Maximum Suppression (NMS).

Figure 4: DETR architecture representation.
Source: End-to-End Object Detection with Transformers

The DETR architecture can be divided into four main components:

  1. Backbone (e.g., ResNet):

    • Function: Extract a feature map rich in semantic and spatial information from the input image.
    • Output: A set of two-dimensional vectors (feature map) that serves as input for the transformer module.
  2. Transformer Encoder-Decoder:

    • Encoder:
      • Function: Process the feature map from the backbone by applying multi-head self-attention and feed-forward layers, capturing global and contextual relationships between different regions of the image.
    • Decoder:
      • Function: Use a fixed set of object queries (embeddings representing candidate objects) to iteratively refine bounding box and class label predictions.
      • Output: A final set of predictions, where each query corresponds to a bounding box and its respective class.
  3. Feed-Forward Network (FFN):

    • Function: For each query processed by the decoder, the FFN generates the final predictions, including normalized box coordinates (x, y, width, and height) and classification scores (including a special class for "no object").
  4. Joint Reasoning:

    • Feature: DETR processes all objects simultaneously, leveraging pairwise relationships between predictions to handle ambiguities such as object overlap, eliminating the need for post-processing techniques like NMS.

The original DETR faced challenges such as slow convergence and suboptimal performance on small object detection. To address these limitations, improved variants have emerged:

  • Deformable DETR:

    • Deformable Attention: Focuses attention on a fixed number of sampling points near each reference, improving efficiency and enabling the detection of fine details in high-resolution images.
    • Multiscale Module: Incorporates feature maps from different backbone layers, enhancing the model's ability to detect objects of varying sizes.

    Figure 5: Deformable DETR mitigates the slow convergence issues and limited feature spatial resolution of the original DETR.
    Source: Deformable DETR: Deformable Transformers for End-to-End Object Detection

  • Conditional DETR:

    • Convergence Improvement: Introduces conditional cross-attention in the decoder, making the object queries more specific to the input image and accelerating training (with reported convergence up to 6.7 times faster).

    Figure 6: Conditional DETR shows faster convergence.
    Source: Conditional DETR for Fast Training Convergence

Thus, DETR and its variants represent a paradigm shift in object detection, harnessing the power of transformers to capture global and contextual relationships in an end-to-end manner while simplifying the detection pipeline by removing manual post-processing steps.

RT-DETR

RT-DETR (Real-Time DEtection TRansformer) is a transformer-based model designed for real-time object detection, extending the foundational DETR architecture with optimizations for speed and efficiency.

Figure 7: RT-DETR performance comparison.
Source: DETRs Beat YOLOs on Real-time Object Detection
  1. Hybrid Encoder:

    • RT-DETR employs a hybrid encoder that integrates intra-scale interaction and cross-scale fusion. This design enables the model to efficiently process features at different scales, capturing both local details and global relationships.
    • This is particularly advantageous for detecting objects of varying sizes in complex scenes, a challenge for earlier transformer models.
  2. Elimination of NMS:

    • Like DETR, RT-DETR eliminates the need for Non-Maximum Suppression (NMS) by predicting all objects simultaneously. This end-to-end approach simplifies the detection pipeline and reduces post-processing overhead.
  3. Real-time Optimization:

    • RT-DETR achieves medium latency (10–200 ms), making it suitable for real-time applications such as autonomous driving and video surveillance, while maintaining competitive accuracy.

RT-DETR approximates high-speed CNN-based models like YOLO and accuracy-focused transformer models.

YOLOS

YOLOS (You Only Look at One Sequence) introduces a novel transformer-based approach to object detection, drawing inspiration from the Vision Transformer (ViT). Unlike traditional CNN-based models, YOLOS reimagines object detection by treating images as sequences of patches, leveraging the transformer’s sequence-processing strengths.

  1. Vision Transformer (ViT) Backbone:

    • YOLOS uses a ViT backbone, which splits the input image into fixed-size patches (e.g., 16x16 pixels). Each patch is treated as a token, allowing the transformer to process the image as a sequence and capture long-range dependencies.
  2. Patch-based Processing:

    • By converting the image into a sequence of patches, YOLOS efficiently handles high-resolution inputs and detects objects across different scales, making it adaptable to diverse detection tasks.
  3. Object Queries:

    • Similar to DETR, YOLOS employs object queries to predict bounding boxes and class labels directly. This eliminates the need for anchor boxes or NMS, aligning with the end-to-end philosophy of transformer-based detection.
  4. Efficiency Variants:

    • YOLOS offers variants such as yolos-tiny, -small, and -base, allowing to balance computational cost and performance based on application needs.

image/png

I think this figure is very cool. It illustrates the specialization of [DET] tokens to focus on specific sizes of objects and distinct regions of the image. This adaptive behavior highlights the effectiveness in object detection tasks, allowing the model to dynamically allocate its detection capabilities based on the spatial and scale characteristics of the objects.


2.2. Dataset Composition and Preprocessing

The dataset was constructed by merging two publicly available benchmarks:

  • Tobacco800: A subset of the Complex Document Image Processing (CDIP) Test Collection, comprising scanned documents from the tobacco industry. Ground truth annotations include signatures and corporate logos.
  • Signatures-XC8UP: Part of the Roboflow 100 benchmark, containing 368 annotated images of handwritten signatures.

Roboflow

The Roboflow platform was employed for dataset management, preprocessing, and annotation. Key workflow components included:

  • Dataset Partitioning:

    • Training: 1,980 images (70%)
    • Validation: 420 images (15%)
    • Testing: 419 images (15%)
  • Preprocessing Pipeline:

    • Auto-orientation and bilinear resampling to 640×640 pixels
  • Spatial-Augmentation Strategy:

    • Rotation (90° fixed and ±10° random)
    • Shear transformations (±4° horizontal, ±3° vertical)
    • Photometric adjustments (±20% brightness/exposure variation)
    • Stochastic blur (σ=1.5) and Gaussian noise (σ=0.05)

These techniques enhance model generalization across diverse document acquisition scenarios.

Dataset Availability:

  • Roboflow Universe:
    Download Badge
  • Hugging Face Datasets Hub:
    HF Dataset

Figure 10: Annotated document samples demonstrating signature and logo detection tasks.

2.3. Training Process and Model Selection

Various models were evaluated using experiments conducted over 35 training epochs with consistent batch size and learning rate settings, based on the available computational capacity. The table below summarizes the results obtained for metrics such as CPU inference time, mAP@50, mAP@50-95, and total training time:

Metric rtdetr-l yolos-base yolos-tiny conditional-detr-resnet-50 detr-resnet-50 yolov8x yolov8l yolov8m yolov8s yolov8n yolo11x yolo11l yolo11m yolo11s yolo11n yolov10x yolov10l yolov10b yolov10m yolov10s yolov10n
Inference Time (CPU - ms) 583.608 1706.49 265.346 476.831 425.649 1259.47 871.329 401.183 216.6 110.442 1016.68 518.147 381.652 179.792 106.656 821.183 580.767 473.109 320.12 150.076 73.8596
mAP@50 0.92709 0.901154 0.869814 0.936524 0.88885 0.794237 0.800312 0.875322 0.874721 0.816089 0.667074 0.707409 0.809557 0.835605 0.813799 0.681023 0.726802 0.789835 0.787688 0.663877 0.734332
mAP@50-95 0.622364 0.583569 0.469064 0.653321 0.579428 0.552919 0.593976 0.665495 0.65457 0.623963 0.482289 0.499126 0.600797 0.638849 0.617496 0.474535 0.522654 0.578874 0.581259 0.473857 0.552704
Training Time 55m 33s 7h 53m 1h 42m 2h 45m 2h 31m 1h 30m 1h 44m 7s 29m 1s 22m 51s 1h 28m 55m 30s 48m 2s 30m 37s 27m 33s 1h 36m 1h 8m 1h 1m 51m 38s 34m 17s 29m 2s
Table 3. Training Results.

Figure 11: Comparison of Inference Time vs. mAP@50

Highlights:

  • Best mAP@50: conditional-detr-resnet-50 (93.65%)
  • Best mAP@50-95: yolov8m (66.55%)
  • Lowest Inference Time: yolov10n (73.86 ms)

The complete experiments are available on Weights & Biases.

Our findings reveal a different behavior from general benchmarks, where larger model versions typically yield incremental improvements. In this experiment, smaller models such as YOLOv8n and YOLOv11s achieved satisfactory and comparable results to their larger counterparts.

Furthermore, purely convolutional architectures demonstrated faster inference and training times compared to transformer-based models while maintaining similar levels of accuracy.


2.4. Hyperparameter Optimization

Due to its strong initial performance and ease of export, the YOLOv8s model was selected for hyperparameter fine-tuning using Optuna. A total of 20 trials were conducted, with the test set F1-score serving as the objective function. The hyperparameter search space included:

dropout = trial.suggest_float("dropout", 0.0, 0.5, step=0.1)
lr0     = trial.suggest_float("lr0", 1e-5, 1e-1, log=True)
box     = trial.suggest_float("box", 3.0, 7.0, step=1.0)
cls     = trial.suggest_float("cls", 0.5, 1.5, step=0.2)
opt     = trial.suggest_categorical("optimizer", ["AdamW", "RMSProp"])

The results of this hypertuning experiment can be viewed here: Hypertuning Experiment.

The figure below illustrates the correlation between the hyperparameters and the achieved F1-score:

Figure 12: Correlation between hyperparameters and the F1-score.

After tuning, the best trial (#10) exhibited the following improvements compared to the base model:

Metric Base Model Best Trial (#10) Difference
mAP@50 87.47% 95.75% +8.28%
mAP@50-95 65.46% 66.26% +0.81%
Precision 97.23% 95.61% -1.63%
Recall 76.16% 91.21% +15.05%
F1-score 85.42% 93.36% +7.94%
Table 4: Comparison between the fine-tuned model and the pre-trained model.

3. Evaluation and Results

The final evaluation of the model was conducted using ONNX format (for CPU inference) and TensorRT (for GPU – T4 inference). The key metrics obtained were:

  • Precision: 94.74%
  • Recall: 89.72%
  • mAP@50: 94.50%
  • mAP@50-95: 67.35%

Regarding inference times:

  • ONNX Runtime (CPU): 171.56 ms
  • TensorRT (GPU – T4): 7.657 ms

Figure 13 presents a graphical comparison of the evaluated metrics:

Figure 13. Model performance under different training hyperparameter configurations.

The Base Model – Phase 1 (☆) represents the model obtained during the selection phase of training.


4. Deployment and Publication

In this section, we describe the deployment process of the handwritten signature detection model using the Triton Inference Server and the publication of both the model and dataset on the Hugging Face Hub. The objective was to create an efficient, cost-effective, and secure solution for making the model available in a production environment, while also providing public access to the work in accordance with open science principles.

4.1. Inference Server

The Triton Inference Server was chosen as the deployment platform due to its flexibility, efficiency, and support for multiple machine learning frameworks, including PyTorch, TensorFlow, ONNX, and TensorRT. It supports inference on both CPU and GPU and offers native tools for performance analysis and optimization, making it ideal for scaling model deployment.

The repository containing the server configurations and utility code is available at: https://github.com/tech4ai/t4ai-signature-detect-server

Server Configuration

For the Triton implementation, we opted for an optimized approach using a custom Docker container. Instead of using the default image—which includes all backends and results in a size of 17.4 GB, we included only the necessary backends for this project: Python, ONNX, and OpenVINO. This optimization reduced the image size to 12.54 GB, thereby improving deployment efficiency. For scenarios that require GPU inference, the TensorRT backend would need to be included and the model would have to be in the .engine format; however, in this case, we prioritized CPU execution to reduce costs and simplify the infrastructure.

The server was configured using the following startup command, which has been adapted to ensure explicit control over models and enhanced security:

tritonserver \
  --model-repository=${TRITON_MODEL_REPOSITORY} \
  --model-control-mode=explicit \
  --load-model=* \
  --log-verbose=1 \
  --allow-metrics=false \
  --allow-grpc=true \
  --grpc-restricted-protocol=model-repository,model-config,shared-memory,statistics,trace:admin-key=${TRITON_ADMIN_KEY} \
  --http-restricted-api=model-repository,model-config,shared-memory,statistics,trace:admin-key=${TRITON_ADMIN_KEY}

The explicit mode allows models to be dynamically loaded and unloaded via HTTP/GRPC protocols, providing the flexibility to update models without restarting the server.

Access Restriction

Triton is configured to restrict access to sensitive administrative endpoints such as model-repository, model-config, shared-memory, statistics, and trace. These endpoints are accessible only to requests that include the correct admin-key header, thereby safeguarding administrative operations while leaving inference requests open to all users.

Ensemble Model

We developed an Ensemble Model within Triton to integrate the YOLOv8 model (in ONNX format) with dedicated pre- and post-processing scripts, thereby creating a complete inference pipeline executed directly on the server. This approach minimizes latency by eliminating the need for multiple network calls and simplifies client-side integration. The Ensemble Model workflow comprises:

  • Preprocessing (Python backend):

    • Decoding the image from BGR to RGB.
    • Resizing the image to 640×640 pixels.
    • Normalizing pixel values (dividing by 255.0).
    • Transposing the data to the [C, H, W] format.
  • Inference (ONNX model):

    • The YOLOv8 model performs signature detection.
  • Post-processing (Python backend):

    • Transposing the model outputs.
    • Filtering bounding boxes based on a confidence threshold.
    • Applying Non-Maximum Suppression (NMS) with an IoU threshold to eliminate redundant detections.
    • Formatting the results in the [x, y, w, h, score] format.

image

Figure 14: Sequence Diagram — Ensemble Model

The pre- and post-processing scripts are treated as separate "models" by Triton and are executed on the Python backend, enabling flexibility for adjustments or replacements without modifying the core model. This Ensemble Model encapsulates the entire inference pipeline into a single entry point, thereby optimizing performance and usability.

Inference Pipeline

Within the project repository, we developed multiple inference scripts to interact with the Triton Inference Server using various methods:

  • Triton Client: Inference via the Triton SDK.
  • Vertex AI: Integration with a Google Cloud Vertex AI endpoint.
  • HTTP: Direct requests to the Triton server via the HTTP protocol.

These scripts were designed to test the server and facilitate future integrations. In addition, the pipeline includes supplementary tools:

  • Graphical User Interface (GUI): Developed with Gradio, this interface enables interactive model testing with real-time visualization of results. For example:

    python signature-detection/gui/inference_gui.py --triton-url {triton_url}
    

  • Command-Line Interface (CLI): This tool allows for batch inference on datasets, timing metrics calculation, and report generation. For example:

    python signature-detection/inference/inference_pipeline.py
    

The pipeline was designed to be modular and extensible, supporting experimentation and deployment in various environments—from local setups to cloud-based infrastructures.


4.2. Hugging Face Hub

The signature detection model has been published on the Hugging Face Hub to make it accessible to the community and to meet the licensing requirements of Ultralytics YOLO. The model repository includes:

  • Model Card:
    A comprehensive document detailing the training process, performance metrics (such as precision and recall), and usage guidelines. Model files are available in PyTorch, ONNX, and TensorRT formats, offering flexibility in selecting the inference backend.

  • Demo Space:
    An interactive interface developed with Gradio, where users can test the model by uploading images and viewing real-time detections.

  • Dataset:
    The dataset used for training and validation is also hosted on Hugging Face, alongside Roboflow, accompanied by comprehensive documentation and files in accessible formats.


5. Conclusion

This work demonstrated the feasibility of developing an efficient model for handwritten signature detection in documents by combining modern computer vision techniques with robust deployment tools. The comparative analysis of architectures revealed that YOLO-based models—particularly YOLOv8s—offer an ideal balance between speed and accuracy for the studied scenario, achieving 94.74% precision and 89.72% recall following hyperparameter optimization with Optuna. The integration of the Triton Inference Server enabled the creation of a scalable inference pipeline, while publication on the Hugging Face Hub ensured transparency and accessibility, in line with open science principles.

The proposed solution has immediate applications in fields such as legal, financial, and administrative sectors, where automating document verification can reduce operational costs and minimize human error. The model’s inference performance on both CPU (171.56 ms) and GPU (7.65 ms) makes it versatile for deployment across a range of infrastructures, from cloud servers to edge devices.

Limitations include the reliance on the quality of dataset annotations and the need for further validation on documents with complex layouts or partially obscured signatures. Future work may explore:

  • Expanding the dataset to include images with greater cultural and typographic diversity.
  • Adapting the model to detect other document elements (e.g., stamps, seals).

Ultimately, the open availability of the code, model, and dataset on the Hugging Face Hub and GitHub fosters community collaboration, accelerating the development of similar solutions. This project not only validates the efficacy of the tested approaches but also establishes a blueprint for the practical deployment of detection models in production, reinforcing the potential of AI to transform traditional workflows.

Community

Sign up or log in to comment