upload retrained model

Browse files

Files changed (12) hide show

.gitattributes +1 -0
1_Pooling/config.json +7 -0
README.md +22 -20
config.json +25 -0
config_sentence_transformers.json +7 -0
modules.json +14 -0
pytorch_model.bin +3 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +14 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ pytorch_model.bin filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false
+}

README.md CHANGED Viewed

@@ -3,6 +3,7 @@ language: de
 library_name: sentence-transformers
 tags:
 - sentence-similarity
 widget:
 - source_sentence: "Bebauungspläne, vorhabenbezogene Bebauungspläne (Geltungsbereiche)"
   sentences:
@@ -10,7 +11,6 @@ widget:
     - "Tagespflege Altenhilfe"
     - "Bebauungsplan der Innenentwicklung gem. § 13a BauGB - Ortskern Rütenbrock"
   example_title: "Bebauungsplan"
-datasets: and-effect/mdk_gov_data_titles_clf
 metrics:
 - accuracy
 - precision
@@ -28,30 +28,29 @@ model-index:
       revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
     metrics:
     - type: accuracy
-      value: 0.6762295081967213
       name: Accuracy 'Bezeichnung'
     - type: precision
-      value: 0.5688091249507292
       name: Precision 'Bezeichnung' (macro)
     - type: recall
-      value: 0.5981436148510813
       name: Recall 'Bezeichnung' (macro)
     - type: f1
-      value: 0.5693466048057273
       name: Recall 'Bezeichnung' (macro)
     - type: accuracy
       value: 0.8934426229508197
       name: Accuracy 'Thema'
     - type: precision
-      value: 0.9258716898716898
       name: Precision 'Thema' (macro)
     - type: recall
-      value: 0.8669105248121641
       name: Recall 'Thema' (macro)
     - type: f1
-      value: 0.8632335412054082
       name: Recall 'Thema' (macro)
-pipeline_tag: sentence-similarity
 ---
 # Model Card for Musterdatenkatalog Classifier
@@ -82,6 +81,7 @@ This model is based on bert-base-german-cased and fine-tuned on and-effect/mdk_g
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 ## Get Started with Sentence Transformers
 Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 ```
@@ -100,6 +100,7 @@ print(embeddings)
 ```
 ## Get Started with HuggingFace Transformers
 Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 ```python
@@ -147,8 +148,7 @@ The model is intended to classify open source dataset titles from german municip
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
-The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names.  Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
 ## Recommendations
@@ -181,6 +181,7 @@ The model is fine tuned with similar and dissimilar pairs. Similar pairs are bui
 ## Training Parameter
 The model was trained with the parameters:
 **DataLoader**:
@@ -190,6 +191,7 @@ The model was trained with the parameters:
 `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
 Hyperparameter:
 ```
 {
     "epochs": 3,
@@ -197,7 +199,6 @@ Hyperparameter:
 }
 ```
 ### Speeds, Sizes, Times
 <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
@@ -211,8 +212,8 @@ All metrices express the models ability to classify dataset titles from GOVDATA
 ## Testing Data, Factors & Metrics
 ### Testing Data
-The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
 ### Metrics
@@ -222,12 +223,13 @@ The model performance is tested with fours metrices. Accuracy, Precision, Recall
 | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
 |-----|-----|-----|-----|-----|
-| Test dataset 'Bezeichnung' I | 0.6762295081967213 | 0.5688091249507292 | 0.5981436148510813 | 0.5693466048057273 |
-| Test dataset 'Thema' I | 0.8934426229508197 | 0.9258716898716898 | 0.8669105248121641 | 0.8632335412054082 |
-| Test dataset 'Bezeichnung' II | 0.6762295081967213 | 0.5598761408083442 | 0.7875393612235718 | 0.6306226331603018 |
-| Validation dataset 'Bezeichnung' I | 0.5445544554455446 | 0.41787439613526567 | 0.39929183135704877 | 0.4010173484686228 |
-| Validation dataset 'Thema' I | 0.801980198019802 | 0.6433080808080808 | 0.7039711632453568 | 0.6591710279769981 |
-| Validation dataset 'Bezeichnung' II | 0.5445544554455446 | 0.6018518518518517 | 0.6278409090909091 | 0.6066776135741653 |
-### Summary

 library_name: sentence-transformers
 tags:
 - sentence-similarity
+datasets: and-effect/mdk_gov_data_titles_clf
 widget:
 - source_sentence: "Bebauungspläne, vorhabenbezogene Bebauungspläne (Geltungsbereiche)"
   sentences:
     - "Tagespflege Altenhilfe"
     - "Bebauungsplan der Innenentwicklung gem. § 13a BauGB - Ortskern Rütenbrock"
   example_title: "Bebauungsplan"
 metrics:
 - accuracy
 - precision
       revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
     metrics:
     - type: accuracy
+      value: 0.7336065573770492
       name: Accuracy 'Bezeichnung'
     - type: precision
+      value: 0.6611111111111111
       name: Precision 'Bezeichnung' (macro)
     - type: recall
+      value: 0.7056652046783626
       name: Recall 'Bezeichnung' (macro)
     - type: f1
+      value: 0.6674970889256604
       name: Recall 'Bezeichnung' (macro)
     - type: accuracy
       value: 0.8934426229508197
       name: Accuracy 'Thema'
     - type: precision
+      value: 0.902382942746851
       name: Precision 'Thema' (macro)
     - type: recall
+      value: 0.8909340386389567
       name: Recall 'Thema' (macro)
     - type: f1
+      value: 0.8768881364249785
       name: Recall 'Thema' (macro)
 ---
 # Model Card for Musterdatenkatalog Classifier
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 ## Get Started with Sentence Transformers
 Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 ```
 ```
 ## Get Started with HuggingFace Transformers
 Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 ```python
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
+The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
 ## Recommendations
 ## Training Parameter
 The model was trained with the parameters:
 **DataLoader**:
 `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
 Hyperparameter:
 ```
 {
     "epochs": 3,
 }
 ```
 ### Speeds, Sizes, Times
 <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 ## Testing Data, Factors & Metrics
 ### Testing Data
+The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
 ### Metrics
 | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
 |-----|-----|-----|-----|-----|
+| Test dataset 'Bezeichnung' I | 0.7336065573770492 | 0.6611111111111111 | 0.7056652046783626 | 0.6674970889256604 |
+| Test dataset 'Thema' I | 0.8934426229508197 | 0.902382942746851 | 0.8909340386389567 | 0.8768881364249785 |
+| Test dataset 'Bezeichnung' II | 0.7336065573770492 | 0.5829457364341085 | 0.8229090167278661 | 0.6544072311514172 |
+| Validation dataset 'Bezeichnung' I | 0.5148514851485149 | 0.346125116713352 | 0.3553921568627451 | 0.33252525252525256 |
+| Validation dataset 'Thema' I | 0.7722772277227723 | 0.5908392682586232 | 0.6784524126899494 | 0.5962308463774738 |
+| Validation dataset 'Bezeichnung' II | 0.5148514851485149 | 0.5768253968253969 | 0.6916666666666667 | 0.592808080808081 |
+### Summary

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_name_or_path": "models/bi_encoder_model/",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.25.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30000
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "__version__": {
+    "sentence_transformers": "2.2.2",
+    "transformers": "4.26.0",
+    "pytorch": "1.13.1"
+  }
+}

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c214c8dc99352a71934c0a2e5d67c690454426eea836c42fa6849b99a1cf8d62
+size 436393773

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 128,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "name_or_path": "models/bi_encoder_model/",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "special_tokens_map_file": null,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff