Rahka commited on
Commit
9e1935f
1 Parent(s): edad5c3

upload retrained model

Browse files
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md CHANGED
@@ -3,6 +3,7 @@ language: de
3
  library_name: sentence-transformers
4
  tags:
5
  - sentence-similarity
 
6
  widget:
7
  - source_sentence: "Bebauungspläne, vorhabenbezogene Bebauungspläne (Geltungsbereiche)"
8
  sentences:
@@ -10,7 +11,6 @@ widget:
10
  - "Tagespflege Altenhilfe"
11
  - "Bebauungsplan der Innenentwicklung gem. § 13a BauGB - Ortskern Rütenbrock"
12
  example_title: "Bebauungsplan"
13
- datasets: and-effect/mdk_gov_data_titles_clf
14
  metrics:
15
  - accuracy
16
  - precision
@@ -28,30 +28,29 @@ model-index:
28
  revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
29
  metrics:
30
  - type: accuracy
31
- value: 0.6762295081967213
32
  name: Accuracy 'Bezeichnung'
33
  - type: precision
34
- value: 0.5688091249507292
35
  name: Precision 'Bezeichnung' (macro)
36
  - type: recall
37
- value: 0.5981436148510813
38
  name: Recall 'Bezeichnung' (macro)
39
  - type: f1
40
- value: 0.5693466048057273
41
  name: Recall 'Bezeichnung' (macro)
42
  - type: accuracy
43
  value: 0.8934426229508197
44
  name: Accuracy 'Thema'
45
  - type: precision
46
- value: 0.9258716898716898
47
  name: Precision 'Thema' (macro)
48
  - type: recall
49
- value: 0.8669105248121641
50
  name: Recall 'Thema' (macro)
51
  - type: f1
52
- value: 0.8632335412054082
53
  name: Recall 'Thema' (macro)
54
- pipeline_tag: sentence-similarity
55
  ---
56
 
57
  # Model Card for Musterdatenkatalog Classifier
@@ -82,6 +81,7 @@ This model is based on bert-base-german-cased and fine-tuned on and-effect/mdk_g
82
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
83
 
84
  ## Get Started with Sentence Transformers
 
85
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
86
 
87
  ```
@@ -100,6 +100,7 @@ print(embeddings)
100
  ```
101
 
102
  ## Get Started with HuggingFace Transformers
 
103
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
104
 
105
  ```python
@@ -147,8 +148,7 @@ The model is intended to classify open source dataset titles from german municip
147
 
148
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
149
 
150
- The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
151
-
152
 
153
  ## Recommendations
154
 
@@ -181,6 +181,7 @@ The model is fine tuned with similar and dissimilar pairs. Similar pairs are bui
181
 
182
 
183
  ## Training Parameter
 
184
  The model was trained with the parameters:
185
 
186
  **DataLoader**:
@@ -190,6 +191,7 @@ The model was trained with the parameters:
190
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
191
 
192
  Hyperparameter:
 
193
  ```
194
  {
195
  "epochs": 3,
@@ -197,7 +199,6 @@ Hyperparameter:
197
  }
198
  ```
199
 
200
-
201
  ### Speeds, Sizes, Times
202
 
203
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
@@ -211,8 +212,8 @@ All metrices express the models ability to classify dataset titles from GOVDATA
211
  ## Testing Data, Factors & Metrics
212
 
213
  ### Testing Data
214
- The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
215
 
 
216
 
217
  ### Metrics
218
 
@@ -222,12 +223,13 @@ The model performance is tested with fours metrices. Accuracy, Precision, Recall
222
 
223
  | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
224
  |-----|-----|-----|-----|-----|
225
- | Test dataset 'Bezeichnung' I | 0.6762295081967213 | 0.5688091249507292 | 0.5981436148510813 | 0.5693466048057273 |
226
- | Test dataset 'Thema' I | 0.8934426229508197 | 0.9258716898716898 | 0.8669105248121641 | 0.8632335412054082 |
227
- | Test dataset 'Bezeichnung' II | 0.6762295081967213 | 0.5598761408083442 | 0.7875393612235718 | 0.6306226331603018 |
228
- | Validation dataset 'Bezeichnung' I | 0.5445544554455446 | 0.41787439613526567 | 0.39929183135704877 | 0.4010173484686228 |
229
- | Validation dataset 'Thema' I | 0.801980198019802 | 0.6433080808080808 | 0.7039711632453568 | 0.6591710279769981 |
230
- | Validation dataset 'Bezeichnung' II | 0.5445544554455446 | 0.6018518518518517 | 0.6278409090909091 | 0.6066776135741653 |
 
231
 
 
232
 
233
- ### Summary
 
3
  library_name: sentence-transformers
4
  tags:
5
  - sentence-similarity
6
+ datasets: and-effect/mdk_gov_data_titles_clf
7
  widget:
8
  - source_sentence: "Bebauungspläne, vorhabenbezogene Bebauungspläne (Geltungsbereiche)"
9
  sentences:
 
11
  - "Tagespflege Altenhilfe"
12
  - "Bebauungsplan der Innenentwicklung gem. § 13a BauGB - Ortskern Rütenbrock"
13
  example_title: "Bebauungsplan"
 
14
  metrics:
15
  - accuracy
16
  - precision
 
28
  revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
29
  metrics:
30
  - type: accuracy
31
+ value: 0.7336065573770492
32
  name: Accuracy 'Bezeichnung'
33
  - type: precision
34
+ value: 0.6611111111111111
35
  name: Precision 'Bezeichnung' (macro)
36
  - type: recall
37
+ value: 0.7056652046783626
38
  name: Recall 'Bezeichnung' (macro)
39
  - type: f1
40
+ value: 0.6674970889256604
41
  name: Recall 'Bezeichnung' (macro)
42
  - type: accuracy
43
  value: 0.8934426229508197
44
  name: Accuracy 'Thema'
45
  - type: precision
46
+ value: 0.902382942746851
47
  name: Precision 'Thema' (macro)
48
  - type: recall
49
+ value: 0.8909340386389567
50
  name: Recall 'Thema' (macro)
51
  - type: f1
52
+ value: 0.8768881364249785
53
  name: Recall 'Thema' (macro)
 
54
  ---
55
 
56
  # Model Card for Musterdatenkatalog Classifier
 
81
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
82
 
83
  ## Get Started with Sentence Transformers
84
+
85
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
86
 
87
  ```
 
100
  ```
101
 
102
  ## Get Started with HuggingFace Transformers
103
+
104
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
105
 
106
  ```python
 
148
 
149
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
150
 
151
+ The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
 
152
 
153
  ## Recommendations
154
 
 
181
 
182
 
183
  ## Training Parameter
184
+
185
  The model was trained with the parameters:
186
 
187
  **DataLoader**:
 
191
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
192
 
193
  Hyperparameter:
194
+
195
  ```
196
  {
197
  "epochs": 3,
 
199
  }
200
  ```
201
 
 
202
  ### Speeds, Sizes, Times
203
 
204
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
212
  ## Testing Data, Factors & Metrics
213
 
214
  ### Testing Data
 
215
 
216
+ The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
217
 
218
  ### Metrics
219
 
 
223
 
224
  | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
225
  |-----|-----|-----|-----|-----|
226
+ | Test dataset 'Bezeichnung' I | 0.7336065573770492 | 0.6611111111111111 | 0.7056652046783626 | 0.6674970889256604 |
227
+ | Test dataset 'Thema' I | 0.8934426229508197 | 0.902382942746851 | 0.8909340386389567 | 0.8768881364249785 |
228
+ | Test dataset 'Bezeichnung' II | 0.7336065573770492 | 0.5829457364341085 | 0.8229090167278661 | 0.6544072311514172 |
229
+ | Validation dataset 'Bezeichnung' I | 0.5148514851485149 | 0.346125116713352 | 0.3553921568627451 | 0.33252525252525256 |
230
+ | Validation dataset 'Thema' I | 0.7722772277227723 | 0.5908392682586232 | 0.6784524126899494 | 0.5962308463774738 |
231
+ | Validation dataset 'Bezeichnung' II | 0.5148514851485149 | 0.5768253968253969 | 0.6916666666666667 | 0.592808080808081 |
232
+
233
 
234
+ ### Summary
235
 
 
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "models/bi_encoder_model/",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.25.1",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30000
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.26.0",
5
+ "pytorch": "1.13.1"
6
+ }
7
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c214c8dc99352a71934c0a2e5d67c690454426eea836c42fa6849b99a1cf8d62
3
+ size 436393773
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "do_lower_case": false,
4
+ "mask_token": "[MASK]",
5
+ "model_max_length": 512,
6
+ "name_or_path": "models/bi_encoder_model/",
7
+ "pad_token": "[PAD]",
8
+ "sep_token": "[SEP]",
9
+ "special_tokens_map_file": null,
10
+ "strip_accents": null,
11
+ "tokenize_chinese_chars": true,
12
+ "tokenizer_class": "BertTokenizer",
13
+ "unk_token": "[UNK]"
14
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff