readme updated
Browse files
README.md
CHANGED
@@ -1,3 +1,49 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
+
tags:
|
4 |
+
- yolov5
|
5 |
+
- yolo
|
6 |
+
- digital-humanities
|
7 |
+
- object-detection
|
8 |
+
- computer-vision
|
9 |
+
- document-layout-analysis
|
10 |
---
|
11 |
+
|
12 |
+
# What's YOLOv5
|
13 |
+
|
14 |
+
YOLOv5 is an open-source object detection model released by [Ultralytics](https://ultralytics.com/), on [Github](https://github.com/ultralytics/yolov5).
|
15 |
+
|
16 |
+
# DataCatalogue (or DataCat)
|
17 |
+
|
18 |
+
(DataCatalogue)[https://github.com/DataCatalogue] is a research projet jointly led by Inria, the Bibliothèque nationale de France (National Library of France) and the Institut national d'histoire de l'art (National Institute of Art History).
|
19 |
+
|
20 |
+
It aims at restructuring OCR-ed auction sale catalogs kept in France national collections into TEI-XML, using machine learning solutions.
|
21 |
+
|
22 |
+
# DataCat Yolov5
|
23 |
+
|
24 |
+
We trained a YOLOv5 model on custom data to perform document layout analysis on auction sale catalogs.
|
25 |
+
|
26 |
+
The training set consists of **581 images**, annotated with **two classes**:
|
27 |
+
* *title* (585 instances)
|
28 |
+
* *entry* (it refers to a catalog entry) (5017 instances)
|
29 |
+
|
30 |
+
59 images were used for validation.
|
31 |
+
|
32 |
+
We reached:
|
33 |
+
| precision | recall | mAP_0.5 | mAP_0.5:0.95 |
|
34 |
+
|---|---|---|---|
|
35 |
+
| 0.99 | 0.99 | 0.98 | 0.75 |
|
36 |
+
|
37 |
+
# Dataset
|
38 |
+
|
39 |
+
The dataset is not released for the moment.
|
40 |
+
|
41 |
+
## Demo
|
42 |
+
|
43 |
+
An interactive demo is available on the following HugginFace Space: https://huggingface.co/spaces/HugoSchtr/DataCat_Yolov5
|
44 |
+
|
45 |
+
## What's next
|
46 |
+
|
47 |
+
The model performs well on our data and now needs to be incorporated into a dedicated pipeline for the research project.
|
48 |
+
|
49 |
+
We also plan to train a new model on a larger training set in the near future.
|