SBB
/

Text Classification
French
annif
glam
lam
Jrglmn commited on
Commit
5031e0b
1 Parent(s): e55557d

Model Card for ark-omikuji-fre-title-content

Browse files
Files changed (1) hide show
  1. README.md +203 -0
README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fr
4
+ tags:
5
+ - annif
6
+ - glam
7
+ - lam
8
+ pipeline_tag: text-classification
9
+ license: apache-2.0
10
+ dataset: "[published on Zenodo](https://zenodo.org/doi/10.5281/zenodo.13301020)"
11
+ library_name: annif
12
+
13
+ ---
14
+
15
+ # Model Card for ark-omikuji-fre-title-content
16
+
17
+
18
+ An [Annif](https://annif.org/) model, trained on historical titles and additional catalogue metadata for automatic subject indexing tasks. It classifies a given text into one or multiple subjects from the “Alter Realkatalog” ([ARK](https://staatsbibliothek-berlin.de/recherche/kataloge-der-staatsbibliothek/alter-realkatalog-und-historische-systematik)) classification system. The model was developed in the research project [Human.Machine.Culture](https://mmk.sbb.berlin/?lang=en) at Staatsbibliothek zu Berlin – Berlin State Library (SBB).
19
+
20
+ # Table of Contents
21
+
22
+ * [Model Card for ark-omikuji-fre-title-content](#model-card-for-ark-omikuji-fre-title-content)
23
+ * [Table of Contents](#table-of-contents)
24
+ * [Model Details](#model-details)
25
+ * [Model Description](#model-description)
26
+ * [Uses](#uses)
27
+ * [Direct Use](#direct-use)
28
+ * [Downstream Use](#downstream-use)
29
+ * [Out-of-Scope Use](#out-of-scope-use)
30
+ * [Bias, Risks, and Limitations](#bias-risks-and-limitations)
31
+ * [Recommendations](#recommendations)
32
+ * [Training Details](#training-details)
33
+ * [Training Data](#training-data)
34
+ * [Training Procedure](#training-procedure)
35
+ * [Preprocessing](#preprocessing)
36
+ * [Speeds, Sizes, Times](#speeds-sizes-times)
37
+ * [Training hyperparameters](#training-hyperparameters)
38
+ * [Training results](#training-results)
39
+ * [Evaluation](#evaluation)
40
+ * [Testing Data, Factors and Metrics](#testing-data-factors-and-metrics)
41
+ * [Testing Data](#testing-data)
42
+ * [Metrics](#metrics)
43
+ * [Environmental Impact](#environmental-impact)
44
+ * [Technical Specifications](#technical-specifications)
45
+ * [Model Architecture and Objective](#model-architecture-and-objective)
46
+ * [Software](#software)
47
+ * [Model Card Authors](#model-card-authors)
48
+ * [Model Card Contact](#model-card-contact)
49
+ * [How to Get Started with the Model](#how-to-get-started-with-the-model)
50
+
51
+ # Model Details
52
+
53
+ ## Model Description
54
+
55
+ An [Annif](https://annif.org/) model, trained on historical titles and additional catalogue metadata for automatic subject indexing tasks. Subject indexing is a classical library task, aiming at describing the content of a resource. The model is intended to be used to automatically classify historical texts with a historical classification system developed in the 19th century to enrich those texts that have not been classified manually so far. For each input text, the model outputs one or multiple subjects from the [ARK](https://staatsbibliothek-berlin.de/recherche/kataloge-der-staatsbibliothek/alter-realkatalog-und-historische-systematik) classification system. It is part of a collection of 5 models, created with the help of the Annif toolkit which addresses this task of automated subject indexing.
56
+
57
+ * **Developed by:** [Sophie Schneider](mailto:sophie.schneider@sbb.spk-berlin.de)
58
+ * **Shared by:** [Staatsbibliothek zu Berlin – Berlin State Library](https://huggingface.co/SBB)
59
+ * **Model type:** tree-based
60
+ * **Language(s) (NLP):** fr (French)
61
+ * **License:** apache-2.0
62
+
63
+ # Uses
64
+
65
+ ## Direct Use
66
+
67
+ This model can directly be used to automatically classify historical texts with the ARK classification scheme. It is intended to be used together with the Annif automated subject indexing toolkit version 0.60.0-1.1.0.
68
+
69
+ ## Downstream Use
70
+
71
+ Other/downstream uses outside of the Annif setting described above are not intended but also not excluded.
72
+
73
+ ## Out-of-Scope Use
74
+
75
+ The model is not intended for use on contemporary texts, as language and concept drifts will probably influence the results negatively and some terms from the vocabulary are not appropriate for more recent publications.
76
+
77
+ # Bias, Risks, and Limitations
78
+
79
+ Since we are dealing with historical texts and especially with a historical classification system such as the ARK, the classes suggested for an input text might not be suitable for today’s understanding or might even be of an unethical nature (for more information, see also [the datasheet accompanying the Metadata of the “Alter Realkatalog” (ARK of Berlin State Library)](https://zenodo.org/doi/10.5281/zenodo.12783813) and the [Datasheet for Machine-Readable Vocabulary Files of the ARK (Alter Realkatalog)](https://zenodo.org/doi/10.5281/zenodo.13301020)).
80
+
81
+ Another limitation when using the ARK as a vocabulary arises from its hierarchical structure: the system contains multiple classes that do not describe the same content (e.g. they have different IDs) but are labeled identical (same name). This is due to the fact that the manual inspection of the whole path to a class, including all the upper level classes leading to it, delivers additional information that allows for contextualization. As duplicate label names seem to be \- as expected \- a challenge for lexical methods, we decided to focus on statistical rather than lexical algorithms.
82
+
83
+ ## Recommendations
84
+
85
+ Considering that the ARK classification scheme consists of 225.691 classes in total and that there is only limited training material at hand plus an overall unbalanced distribution of classes, we might describe this task as an Extreme Multi-Label Classification (XMC) problem. We recommend being aware of this limitation and, if available, use additional training data to improve the current model’s performance (e.g. by running `annif learn`, see [CLI commands documentation](https://annif.readthedocs.io/en/v1.1.0/source/commands.html\#annif-learn)).
86
+
87
+ # Training Details
88
+
89
+ ## Training Data
90
+
91
+ Training data include a selection of metadata fields that were obtained via CBS export:
92
+
93
+ * Lehmann, J., & Schneider, S. (2024). Metadata of the "Alter Realkatalog" (ARK) of Berlin State Library (SBB) (Version 1\) \[Data set\]. Zenodo. [https://doi.org/10.5281/zenodo.12783813](https://doi.org/10.5281/zenodo.12783813)
94
+
95
+ The following title and content data fields have been extracted and combined from this dataset:
96
+
97
+ * "Abweichender Titel" ([4212](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4212\&katalog=Standard))
98
+ * "Abweichender Titel (Sucheinstieg)" ([3260](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=3260\&katalog=Standard))
99
+ * "Ansetzungssachtitel" ([3220](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=3220\&katalog=Standard))
100
+ * "Einheitssachtitel des beigefügten oder kommentierten Werkes" ([4210](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4210\&katalog=Standard))
101
+ * "Frühere/frühester Haupttitel (nur für fortlaufende und integrierende Ressourcen)" ([4213](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4213\&katalog=Standard))
102
+ * "Gesamttitel der Reproduktion" ([4110](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4110\&katalog=Standard))
103
+ * "Gesamttitel der fortlaufenden Ressource" ([4170](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4170\&katalog=Standard))
104
+ * "Gesamttitel der mehrteiligen Monografie" ([4150](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4150\&katalog=Standard))
105
+ * "Haupttitel, Titelzusatz, Verantwortlichkeitsangabe" ([4000](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4000\&katalog=Standard))
106
+ * "Normierter Zeitschriftenkurztitel" ([3232](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=3232\&katalog=Standard))
107
+ * "Paralleltitel, paralleler Titelzusatz, parallele Verantwortlichkeitsangabe" ([4002](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4002\&katalog=Standard))
108
+ * "Titelkonkordanzen" ([4245](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4245\&katalog=Standard))
109
+ * "Titelzusätze und Verantwortlichkeitsangabe zur gesamten Vorlage" ([4011](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4011\&katalog=Standard))
110
+ * "Weitere Titel etc. bei Zusammenstellungen" ([4010](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4010\&katalog=Standard))
111
+ * "Weiterer Werktitel und sonstige unterscheidende Merkmale" ([3211](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=3211\&katalog=Standard))
112
+ * "Werktitel und sonstige unterscheidende Merkmale des Werks" ([3210](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=3210\&katalog=Standard))
113
+ * "Zusätzliche Sucheinstiege" ([4200](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4200\&katalog=Standard))
114
+ * "Veröffentlichungsart und Inhalt" ([1140](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=1140\&katalog=Standard))
115
+ * "Sonstige Anmerkungen" ([4201](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4201\&katalog=Standard))
116
+ * "Zusammenfassende Register" ([4203](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4203\&katalog=Standard))
117
+ * "Inhaltliche Zusammenfassung" ([4207](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=4207\&katalog=Standard) bzw. [9000](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=9000\&katalog=Standard))
118
+ * "Einleitender Text" ([7124](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat\&val=7124\&katalog=Standard))
119
+
120
+ The vocabulary files themselves can be found here:
121
+ * Schneider, S., & Lehmann, J. (2024). Machine-Readable Vocabulary Files of the "Alter Realkatalog" (ARK) of Berlin State Library (SBB) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13301020
122
+
123
+ ## Training Procedure
124
+
125
+ Training procedure includes loading the ARK vocabulary (see [Datasheet for Machine-Readable Vocabulary Files of the ARK (Alter Realkatalog)](https://zenodo.org/doi/10.5281/zenodo.13301020)) into Annif and training the [Omikuji backend](https://github.com/NatLibFi/Annif/wiki/Backend%3A-Omikuji) with the help of our training data. Further aspects on technical specifications can be found in the section [Training hyperparameters](#training-hyperparameters).
126
+
127
+ ### Preprocessing
128
+
129
+ Besides merging and transforming the data described under [Training Data](\#training-data) to fit the [corpus formats](https://github.com/NatLibFi/Annif/wiki/Corpus-formats) accepted by Annif, no further preprocessing of natural language or similar has been performed.
130
+
131
+ ### Speeds, Sizes, Times
132
+
133
+ Training takes from several minutes to a few hours on a V100, depending on the choice of dataset and algorithm as well as hyperparameter settings.
134
+
135
+ ### Training hyperparameters
136
+
137
+ For some of the ARK Annif models, a slight hyperparameter optimization has been conducted to identify the final hyperparameter settings stated below.
138
+
139
+ hyperparameter configuration (as it needs to be stated in the Annif `projects.cfg` file):
140
+ ```
141
+ [ark-omikuji-fre-title-content]
142
+ name=ARK-DE-18 Omikuji
143
+ language=fr
144
+ backend=omikuji
145
+ analyzer=snowball(french)
146
+ vocab=arktsv
147
+ cluster_k=2
148
+ collapse_every_n_layers=5
149
+ ```
150
+
151
+ ### Training results
152
+
153
+ * Precision (`--limit` 1, `--threshold` 0): 0.4228
154
+ * Recall (`--limit` 1, `--threshold` 0): 0.4044
155
+ * F1 (`--limit` 1, `--threshold` 0): 0.4103
156
+ * NDCG (`--limit` 1, `--threshold` 0): 0.4169
157
+ * F1@5: 0.1885
158
+ * NDCG@5: 0.4939
159
+
160
+ # Evaluation
161
+
162
+ ## Testing Data, Factors and Metrics
163
+
164
+ ### Testing Data
165
+
166
+ The dataset is described under [Training Data](#training-data). It was split into smaller subsets used for training, testing and validating (80%/10%/10% split).
167
+
168
+ ### Metrics
169
+
170
+ Model performance has been evaluated based on the following metrics: Precision, Recall, F1 and NDCG. These are standard metrics for machine learning and more specifically automatic subject indexing tasks and are directly provided in Annif by calling the `annif eval` statement. Evaluation parameters (`--limit` = maximum number of results to return; `--threshold` = minimum confidence for a suggestion to be considered) have been optimized before using the validation dataset and affect the results accordingly. We also state F1@5 and NDCG@5 scores reached without any evaluation parameters.
171
+
172
+ # Environmental Impact
173
+
174
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact\#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
175
+
176
+ * **Hardware Type:** V100
177
+ * **Hours used:** 0,5-5 hours
178
+ * **Cloud Provider:** No cloud.
179
+ * **Compute Region:** Germany.
180
+
181
+ # Technical Specifications
182
+
183
+ ## Model Architecture and Objective
184
+
185
+ See [Annif](https://github.com/NatLibFi/Annif) and [Omikuji](https://github.com/tomtung/omikuji) repositories on Github. Omikuji is an implementation of Partitioned Label Trees (Prabhu et al., 2018):
186
+ * Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 993–1002.
187
+
188
+ ### Software
189
+
190
+ To run this model, Annif version 0.60.0 or higher (min. up to 1.1.0) must be installed.
191
+
192
+ # Model Card Authors
193
+
194
+ [Sophie Schneider](mailto:sophie.schneider@sbb.spk-berlin.de) and [Jörg Lehmann](mailto:joerg.lehmann@sbb.spk-berlin.de)
195
+
196
+ # Model Card Contact
197
+
198
+ Questions and comments about the model can be directed to Sophie Schneider at sophie.schneider@sbb.spk-berlin.de, questions and comments about the model card can be directed to Jörg Lehmann at joerg.lehmann@sbb.spk-berlin.de.
199
+
200
+ # How to Get Started with the Model
201
+
202
+ Follow the Annif [Getting Started](https://github.com/NatLibFi/Annif/wiki/Getting-started) page to set up and run Annif. Create a projects.cfg file (see section [Training hyperparameters](#training-hyperparameters) for details on the specific project configuration), load the ARK vocabulary (see [Datasheet for Machine-Readable Vocabulary Files of the ARK (Alter Realkatalog)](https://zenodo.org/doi/10.5281/zenodo.13301020)) via `annif load-vocab` command and copy the model folder over to `data/projects`.
203
+