SBB
/

Keras
cneud Jrglmn commited on
Commit
2ba092f
1 Parent(s): 32a49a2

Added generic model card (#1)

Browse files

- Added generic model card (26086b5981d0810291db717bb43abc86bc5b8a40)


Co-authored-by: Jörg Lehmann <Jrglmn@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +243 -0
README.md CHANGED
@@ -1,3 +1,246 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Model Card for eynollah-binarization
6
+
7
+ <!-- Provide a quick summary of what the model is/does. [Optional] -->
8
+ This model is part of a suite of 13 models. The suite introduces an end-to-end pipeline to extract layout, text lines and reading order for historic documents, where the output can be used as an input for OCR engines.
9
+
10
+ Questions and comments about the models can be directed to Vahid Rezanezhad at Vahid.Rezanezhad@sbb.spk-berlin.de.
11
+
12
+ # Table of Contents
13
+
14
+ - [Model Card for
15
+ eynollah-binarization](#model-card-for-
16
+ eynollah-binarization)
17
+ - [Table of Contents](#table-of-contents)
18
+ - [Model Details](#model-details)
19
+ - [Model Description](#model-description)
20
+ - [Uses](#uses)
21
+ - [Direct Use](#direct-use)
22
+ - [Downstream Use](#downstream-use)
23
+ - [Out-of-Scope Use](#out-of-scope-use)
24
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
25
+ - [Recommendations](#recommendations)
26
+ - [Training Details](#training-details)
27
+ - [Training Data](#training-data)
28
+ - [Training Procedure](#training-procedure)
29
+ - [Preprocessing](#preprocessing)
30
+ - [Speeds, Sizes, Times](#speeds-sizes-times)
31
+ - [Evaluation](#evaluation)
32
+ - [Testing Data, Factors and Metrics](#testing-data-factors-and-metrics)
33
+ - [Testing Data](#testing-data)
34
+ - [Metrics](#metrics)
35
+ - [Model Examination](#model-examination)
36
+ - [Environmental Impact](#environmental-impact)
37
+ - [Technical Specifications](#technical-specifications)
38
+ - [Model Architecture and Objective](#model-architecture-and-objective)
39
+ - [Software](#software)
40
+ - [Citation](#citation)
41
+ - [More Information](#more-information)
42
+ - [Model Card Authors](#model-card-authors)
43
+ - [Model Card Contact](#model-card-contact)
44
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
45
+
46
+
47
+ # Model Details
48
+
49
+ ## Model Description
50
+
51
+ <!-- Provide a longer summary of what this model is/does. -->
52
+
53
+ This suite of 13 models presents a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions.
54
+
55
+ The detection and classification of multiple classes of layout elements such as headings, images, tables etc. as part of DLA is required in order to extract and process them in subsequent steps. Altogether, the combination of image detection, classification and segmentation on the wide variety that can be found in over 400 years of printed cultural heritage makes this a very challenging task. Deep learning models are complemented with heuristics for the detection of text lines, marginals, and reading order. Furthermore, an optional image enhancement step was added in case of documents that either have insufficient pixel density and/or require scaling. Also, a column classifier for the analysis of multi-column documents was added. With these additions, DLA performance was improved, and a high accuracy in the prediction of the reading order is accomplished.
56
+
57
+ Two Arabic/Persian terms form the name of the model suite: عين الله, which can be transcribed as "ain'allah" or "eynollah"; it translates into English as "God's Eye" -- it sees (nearly) everything on the document image.
58
+
59
+
60
+
61
+ - **Developed by:** [Vahid Rezanezhad](Vahid.Rezanezhad@sbb.spk-berlin.de)
62
+ - **Shared by:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
63
+ - **Model type:** Neural Network
64
+ - **Language(s) (NLP):** Irrelevant; works on all languages
65
+ - **License:** apache-2.0
66
+ - **Resources for more information:**
67
+ - The GitHub Repo can be found [here](https://github.com/qurator-spk/eynollah)
68
+ - Associated Paper: [Document Layout Analysis with Deep Learning and Heuristics](https://doi.org/10.1145/3604951.3605513)
69
+
70
+ # Uses
71
+
72
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
73
+
74
+ The intended use of the suite is performing document layout analysis (DLA) on image data. The system returns the results in [PAGE-XML format](https://github.com/PRImA-Research-Lab/PAGE-XML).
75
+
76
+ ## Direct Use
77
+
78
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
79
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
80
+
81
+ The system performs document layout analysis in a series of steps. First, the image is cropped and the number of columns determined. Then the pixels-per-inch (ppi) rate is detected, and when ppi is below 300, the image is re-scaled and enhanced. Now the main region types (text regions, images, separators, background) are detected as the early layout. Marginals are detected with a heuristic method, then -- optionally -- headers (or headings or floatings) and drop capitals. Next, textline segmentation is performed for text regions, and for each text region the slope of deskewing is calculated. Heuristics are used to determine bounding boxes (or contours in the case of curved lines) of textlines in each region after deskewing. After that, the reading order of text regions is detected based on separators, headers (or headings or floatings) and the coordinates of columns. Finally, all results are written to a PAGE-XML file.
82
+
83
+ **Within the suite, the model *eynollah-binarization/2021-04-25/saved_model.pb* is used for the task of binarization**.
84
+
85
+ This model is designed to tackle the intricate task of document image binarization, which involves segmentation of the image into white and black pixels. This process significantly contributes to the overall performance of the layout models, particularly in scenarios where the documents are degraded or exhibit subpar quality. The robust binarization capability of the model enables improved accuracy and reliability in subsequent layout analysis, thereby facilitating enhanced document understanding and interpretation.
86
+
87
+
88
+ ## Downstream Use
89
+
90
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
91
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
92
+
93
+ The intended use of this suite of 13 models is conceived of as a system. Comparable systems are [dhSegment](https://doi.org/10.1109/ICFHR-2018.2018.00011) and [P2PaLA](https://github.com/lquirosd/P2PaLA). However -- and as always with a system -- , every component can potentially used on its own. Each model might therefore be used or trained for other downstream purposes.
94
+
95
+
96
+ ## Out-of-Scope Use
97
+
98
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
99
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
100
+
101
+ This model suite does **NOT** perform any Optical Character Recognition (OCR), it is an image-to-PAGE-XML system only.
102
+
103
+ # Bias, Risks, and Limitations
104
+
105
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
106
+
107
+ The pre-processing of digitised historical and contemporary texts is a task contributing to knowledge creation, both by developing models facilitating the necessary pre-processing steps of document layout analysis and ultimately by enabling better discoverability of information in the processed works. Since the content of the document images is not touched, ethical challenges cannot be identified. The endeavor of developing the model was not undertaken for profit. Though a commercial product based on this model may be developed in the future, it will always remain openly accessible without any commercial interest. The aim of the development of this model was to improve document layout analysis, an endeavour that is not for profit. As a technical limitation, it has to be noted that there is a lot of performance to gain for historical text by adding more historical Ground Truth data.
108
+
109
+ ## Recommendations
110
+
111
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
112
+
113
+ The application of machine learning models to convert a document image into PAGE-XML is a process which can still be improved. The suite of 13 models performs pixel-wise segmentation, which is done in patches; it therefore lacks a global understanding of documents and makes the document layout analysis system unable to detect some document subcategories like page numbers. The end-to-end system with different stages uses the output of each task for the next step. Therefore a poor prediction in early steps may cause a poor final document information extraction. Patch-wise segmentation can cause problems to segment pixels between text blocks, large scale drop capitals, headers and tables, since in each patch the model sees only a part of the document element and not all of it. Therefore, there is a lot to gain by improving the existing model suite.
114
+
115
+ # Training Details
116
+
117
+ ## Training Data
118
+
119
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
120
+
121
+ For model training, Ground Truth in PAGE-XML format was sourced mainly from three datasets. The [IMPACT dataset of historical document images](https://doi.org/10.1145/2501115.2501130) contains a representative sample of historical books and newspapers from Europe’s major libraries. Due to restrictions, only the open data from the National Library of the Netherlands (KB) and the Poznan Supercomputing and Networking Center (PSNC) were used. The [Europeana Newspapers Project (ENP) image and ground truth dataset of historical newspapers](https://doi.org/10.1109/ICDAR.2015.7333898) contains images representative of the newspaper digitisation projects of 12 national and major European libraries. The [OCR-D dataset](https://doi.org/10.1145/3322905.3322916) of German prints between 1500 and 1900 is based on a selection from the holdings of the “German Text Archive” (DTA), the Digitized Collections of the Berlin State Library and the Wolfenbüttel Digital Library.
122
+
123
+ ## Training Procedure
124
+
125
+ All models, except for the column classifier, are designed for pixelwise segmentation. The training of these models follows a patchwise approach, wherein the documents are divided into smaller patches and fed into the models during training. To train these segmentation models, annotated labels are required. Each sub-element in the document is assigned a unique value for identification.
126
+
127
+ If you consider examining the training process, take a look at the repository which contains the source code for training an encoder model for document image segmentation [on GitHub](https://github.com/qurator-spk/sbb_pixelwise_segmentation).
128
+
129
+
130
+ ### Preprocessing
131
+
132
+ In order to use this suite of models for historical documents, no preprocessing is needed for the input images.
133
+
134
+ ### Speeds, Sizes, Times
135
+
136
+ The duration of training per epoch varies, typically lasting between 2 to 5 hours, depending on the specific use case, the datasets used, and the extent of applied data augmentation.
137
+ Our ResNet-50-U-Net model has 38.15M parameters where only parameters of the decoder part are fully trained (14.71M parameters).
138
+ In the case of the column classifier, we used a ResNet-50 with two dense connected layers at the top. The column classifier model has 25.6M parameters, where only parameters of dense layers are fully trained (2.16M parameters).
139
+
140
+ ### Training hyperparameters
141
+
142
+ Within the context of a constant model architecture, our hyperparameters encompass diverse augmentations, each characterized by its unique set of parameters. In addition to these, the model training process involves other hyperparameters, including the choice of the loss function, the number of batches utilized, the patch size of input images, and the number of epochs.
143
+
144
+ ### Training results
145
+
146
+ Training results are reported in [this paper](https://doi.org/10.1145/3604951.3605513).
147
+
148
+ # Evaluation
149
+
150
+ Given the inadequacy of the prevailing segmentation metric in achieving optimal outcomes for document segmentation, particularly with respect to isolating regions as effectively as desired, we proceeded to evaluate our model using smaller batches of the three above-named datasets used for training. In pursuit of improved results, we employed an ensemble learning approach by aggregating the best epoch weights.
151
+
152
+ ## Testing Data, Factors and Metrics
153
+
154
+ ### Testing Data
155
+
156
+ Three new datasets were used for evaluation to ensure an unbiased comparison. [The PRImA Layout Analysis Dataset](https://www.primaresearch.org/datasets/Layout_Analysis) contains 478 pages of realistic documents, reflecting various challenges in layout analysis.
157
+ [The German-Brazilian Newspapers (GBN) Dataset](https://web.inf.ufpr.br/vri/databases/gbn/) comprises 152 pages from historical newspapers. We used only 61 pages to keep the complexity of the documents similar. Finally, the (unreleased) Vlaamse Erfgoedbibliotheken (VEB) dataset comprises ground truth for 75 pages from historical Belgian newspapers, split across three sets.
158
+
159
+ ### Metrics
160
+
161
+ In addition to utilizing conventional performance metrics such as mean Intersection over Union (mIoU), precision, recall, and F1-score, we have incorporated the [Prima layout evaluation](https://www.primaresearch.org/tools/PerformanceEvaluation) metrics, namely Merge, Split, Miss, and the overall success rate.
162
+
163
+ # Model Examination
164
+
165
+ Examination results are reported in [this paper](https://doi.org/10.1145/3604951.3605513).
166
+
167
+ # Environmental Impact
168
+
169
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
170
+
171
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
172
+
173
+ - **Hardware Type:** GeForce RTX 2070
174
+ - **Hours used:** 2 to 5 hours per epoch
175
+ - **Cloud Provider:** No cloud.
176
+ - **Compute Region:** Germany.
177
+ - **Carbon Emitted:** More information needed.
178
+
179
+ # Technical Specifications
180
+
181
+ ## Model Architecture and Objective
182
+
183
+ See [publication](https://doi.org/10.1145/3604951.3605513).
184
+
185
+ ### Software
186
+
187
+ See the code published on [GitHub](https://github.com/qurator-spk/eynollah).
188
+
189
+ # Citation
190
+
191
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
192
+
193
+
194
+
195
+ **BibTeX:**
196
+
197
+ ```bibtex
198
+ @inproceedings{10.1145/3604951.3605513,
199
+ author = {Rezanezhad, Vahid and Baierer, Konstantin and Gerber, Mike and Labusch, Kai and Neudecker, Clemens},
200
+ title = {Document Layout Analysis with Deep Learning and Heuristics},
201
+ year = {2023},
202
+ isbn = {9798400708411},
203
+ publisher = {Association for Computing Machinery},
204
+ address = {New York, NY, USA},
205
+ url = {https://doi.org/10.1145/3604951.3605513},
206
+ doi = {10.1145/3604951.3605513},
207
+ abstract = {The automated yet highly accurate layout analysis (segmentation) of historical document images remains a key challenge for the improvement of Optical Character Recognition (OCR) results. But historical documents exhibit a wide array of features that disturb layout analysis, such as multiple columns, drop capitals and illustrations, skewed or curved text lines, noise, annotations, etc. We present a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions. Our system can detect more layout classes (e.g. initials, marginals) and achieves higher accuracy than competitive approaches. We describe the algorithm, the different models and how they were trained and discuss our results in comparison to the state-of-the-art on the basis of three historical document datasets.},
208
+ booktitle = {Proceedings of the 7th International Workshop on Historical Document Imaging and Processing},
209
+ pages = {73–78},
210
+ numpages = {6},
211
+ keywords = {Document layout analysis, Reading order detection, Segmentation},
212
+ location = {San Jose, CA, USA},
213
+ series = {HIP '23}
214
+ }
215
+ ```
216
+
217
+ **APA:**
218
+
219
+ (Rezanezhad et al., 2023)
220
+
221
+ # More Information
222
+
223
+ SHA256 and MD5 hashes for the model *eynollah-binarization/2021-04-25/saved_model.pb*:
224
+
225
+ SHA256: 18dc9879828a42d8f12845f6026d4835acf7ac70f82abda68ad3a5cc17b9e44a
226
+ MD5: 0544acbce4a19868a5a9d62c284a648a
227
+
228
+
229
+ # Model Card Authors
230
+
231
+ <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
232
+
233
+ [Vahid Rezanezhad](vahid.rezanezhad@sbb.spk-berlin.de), [Clemens Neudecker](clemens.neudecker@sbb.spk-berlin.de) and [Jörg Lehmann](joerg.lehmann@sbb.spk-berlin.de)
234
+
235
+ # Model Card Contact
236
+
237
+ Questions and comments about the model can be directed to [Vahid Rezanezhad](Vahid.Rezanezhad@sbb.spk-berlin.de), questions and comments about the model card can be directed to [Jörg Lehmann](joerg.lehmann@sbb.spk-berlin.de).
238
+
239
+ # How to Get Started with the Model
240
+
241
+ How to get started with this model is explained in the ReadMe file of the GitHub repository [over here](https://github.com/qurator-spk/eynollah).
242
+
243
+ &nbsp;
244
+
245
+ Model Card as of August 17th, 2023
246
+