Rahka commited on
Commit
93c32a1
1 Parent(s): dd85427

upload read me with all training params (manual)

Browse files
Files changed (1) hide show
  1. README.md +25 -35
README.md CHANGED
@@ -48,30 +48,26 @@ model-index:
48
 
49
  # Model Card for Musterdatenkatalog Classifier
50
 
51
- <!-- Provide a quick summary of what the model is/does. -->
52
-
53
  # Model Details
54
 
55
  ## Model Description
56
 
57
- <!-- Provide a longer summary of what this model is. -->
58
-
59
  This model is based on bert-base-german-cased and fine-tuned on and-effect/mdk_gov_data_titles_clf. This model reaches an accuracy of XY on the test set and XY on the validation set
60
 
61
  - **Developed by:** and-effect
62
  - **Shared by:** [More Information Needed]
63
  - **Model type:** Text Classification
64
  - **Language(s) (NLP):** de
65
- - **License:** XY
66
- - **Finetuned from model:** bert-base-german-case. For more information one the model check on [this model card](https://huggingface.co/bert-base-german-cased)
67
 
68
  ## Model Sources
69
 
70
  <!-- Provide the basic links for the model. -->
71
 
72
- - **Repository:** XY git hub repo?
73
- - **Paper:** website bst?
74
- - **Demo:** XY Spaces?
75
 
76
  # Direct Use
77
 
@@ -136,16 +132,15 @@ print(sentence_embeddings)
136
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
137
 
138
  The model is intended to classify open source dataset titles from german municipalities. More information on the Taxonomy (classification categories) and the Project can be found on XY.
139
- For more information see Github Repo + Spaces
 
140
 
141
  # Bias, Risks, and Limitations
142
 
143
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
144
 
145
- The model has some limititations. The model has some limitations in terms of the downstream task.
146
- 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited.
147
- 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'.
148
- 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
149
 
150
  ## Recommendations
151
 
@@ -161,13 +156,21 @@ Users (both direct and downstream) should be made aware of the risks, biases and
161
 
162
  You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data.
163
 
164
- ## Training Procedure [optional]
165
-
166
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
167
 
168
  ### Preprocessing
169
 
170
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
171
 
172
  ## Training Parameter
173
  The model was trained with the parameters:
@@ -181,16 +184,8 @@ The model was trained with the parameters:
181
  Hyperparameter:
182
  ```
183
  {
184
- "epochs": [More Information Needed],
185
- "evaluation_steps": 0,
186
- "evaluator": NoneType,
187
- "max_grad_norm": 1,
188
- "optimizer_class": <class 'torch.optim.adamw.AdamW'>,
189
- "optimizer_params": {'learning rate': 2e-05},
190
- "scheduler": WarmupLinear,
191
- "steps_per_epoch": null,
192
- "warmup_steps": 100,
193
- "weight_decay":0.01
194
  }
195
  ```
196
 
@@ -203,21 +198,16 @@ Hyperparameter:
203
 
204
  # Evaluation
205
 
206
- <!-- This section describes the evaluation protocols and provides the results. -->
207
 
208
  ## Testing Data, Factors & Metrics
209
 
210
  ### Testing Data
211
-
212
- <!-- This should link to a Data Card if possible. -->
213
-
214
  The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
215
 
216
 
217
  ### Metrics
218
 
219
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
220
-
221
  The model performance is tested with fours metrices. Accuracy, Precision, Recall and F1 Score. A lot of classes were not predicted and are thus set to zero for the calculation of precision, recall and f1 score. For these metrices the additional calucations were performed exluding classes with less than two predictions for the level 'Bezeichnung' (see in table results 'Bezeichnung II'. Although intepretation of these results should be interpreted with caution, because they do not represent all classes.
222
 
223
  ## Results
@@ -229,7 +219,7 @@ The model performance is tested with fours metrices. Accuracy, Precision, Recall
229
  | Test dataset 'Bezeichnung' II | 0.7004405286343612 | 0.573015873015873 | 0.8207602339181287 | 0.6515010351966875 |
230
  | Validation dataset 'Bezeichnung' I | 0.5445544554455446 | 0.41787439613526567 | 0.39929183135704877 | 0.4010173484686228 |
231
  | Validation dataset 'Thema' I | 0.801980198019802 | 0.6433080808080808 | 0.7039711632453568 | 0.6591710279769981 |
232
- | Validation dataset 'Bezeichnung' II | 0.5445544554455446 | 0.6018518518518517 | 0.6278409090909091 | 0.6066776135741653 |
233
 
234
 
235
  ### Summary
 
48
 
49
  # Model Card for Musterdatenkatalog Classifier
50
 
 
 
51
  # Model Details
52
 
53
  ## Model Description
54
 
 
 
55
  This model is based on bert-base-german-cased and fine-tuned on and-effect/mdk_gov_data_titles_clf. This model reaches an accuracy of XY on the test set and XY on the validation set
56
 
57
  - **Developed by:** and-effect
58
  - **Shared by:** [More Information Needed]
59
  - **Model type:** Text Classification
60
  - **Language(s) (NLP):** de
61
+ - **License:** [More Information Needed]
62
+ - **Finetuned from model:** "bert-base-german-case. For more information one the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
63
 
64
  ## Model Sources
65
 
66
  <!-- Provide the basic links for the model. -->
67
 
68
+ - **Repository:** [More Information Needed]
69
+ - **Paper:** [More Information Needed]
70
+ - **Demo:** [More Information Needed]
71
 
72
  # Direct Use
73
 
 
132
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
133
 
134
  The model is intended to classify open source dataset titles from german municipalities. More information on the Taxonomy (classification categories) and the Project can be found on XY.
135
+
136
+ [More Information Needed on downstream_use_demo]
137
 
138
  # Bias, Risks, and Limitations
139
 
140
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
141
 
142
+ The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
143
+
 
 
144
 
145
  ## Recommendations
146
 
 
156
 
157
  You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data.
158
 
159
+ ## Training Procedure
 
 
160
 
161
  ### Preprocessing
162
 
163
+ This section describes the generating of the input data for the model. More information on the preprocessing of the data itself can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf)
164
+
165
+ The model is fine tuned with similar and dissimilar pairs. Similar pairs are built with all titles and their true label. Dissimilar pairs defined as pairs of title and all labels, except the true label. Since the combinations of dissimilar is much higher, a sample of two pairs per title is selected.
166
+
167
+ | pairs | size |
168
+ |-----|-----|
169
+ | train_similar_pairs | 2018 |
170
+ | train_unsimilar_pairs | 1009 |
171
+ | test_similar_pairs | 498 |
172
+ | test_unsimilar_pairs | 249 |
173
+
174
 
175
  ## Training Parameter
176
  The model was trained with the parameters:
 
184
  Hyperparameter:
185
  ```
186
  {
187
+ "epochs": 3,
188
+ "warumup_steps": [More Information Needed],
 
 
 
 
 
 
 
 
189
  }
190
  ```
191
 
 
198
 
199
  # Evaluation
200
 
201
+ All metrices express the models ability to classify dataset titles from GOVDATA into the taxonomy described [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For more information see VERLINKUNG MDK Projekt.
202
 
203
  ## Testing Data, Factors & Metrics
204
 
205
  ### Testing Data
 
 
 
206
  The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
207
 
208
 
209
  ### Metrics
210
 
 
 
211
  The model performance is tested with fours metrices. Accuracy, Precision, Recall and F1 Score. A lot of classes were not predicted and are thus set to zero for the calculation of precision, recall and f1 score. For these metrices the additional calucations were performed exluding classes with less than two predictions for the level 'Bezeichnung' (see in table results 'Bezeichnung II'. Although intepretation of these results should be interpreted with caution, because they do not represent all classes.
212
 
213
  ## Results
 
219
  | Test dataset 'Bezeichnung' II | 0.7004405286343612 | 0.573015873015873 | 0.8207602339181287 | 0.6515010351966875 |
220
  | Validation dataset 'Bezeichnung' I | 0.5445544554455446 | 0.41787439613526567 | 0.39929183135704877 | 0.4010173484686228 |
221
  | Validation dataset 'Thema' I | 0.801980198019802 | 0.6433080808080808 | 0.7039711632453568 | 0.6591710279769981 |
222
+ | Validation dataset 'Bezeichnung' II | 0.5445544554455446 | 0.6018518518518519 | 0.6278409090909091 | 0.6066776135741653 |
223
 
224
 
225
  ### Summary