upload retrained model
Browse files- .gitattributes +1 -0
- 1_Pooling/config.json +7 -0
- README.md +22 -20
- config.json +25 -0
- config_sentence_transformers.json +7 -0
- modules.json +14 -0
- pytorch_model.bin +3 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +14 -0
- vocab.txt +0 -0
.gitattributes
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
|
1_Pooling/config.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"word_embedding_dimension": 768,
|
3 |
+
"pooling_mode_cls_token": false,
|
4 |
+
"pooling_mode_mean_tokens": true,
|
5 |
+
"pooling_mode_max_tokens": false,
|
6 |
+
"pooling_mode_mean_sqrt_len_tokens": false
|
7 |
+
}
|
README.md
CHANGED
@@ -3,6 +3,7 @@ language: de
|
|
3 |
library_name: sentence-transformers
|
4 |
tags:
|
5 |
- sentence-similarity
|
|
|
6 |
widget:
|
7 |
- source_sentence: "Bebauungspläne, vorhabenbezogene Bebauungspläne (Geltungsbereiche)"
|
8 |
sentences:
|
@@ -10,7 +11,6 @@ widget:
|
|
10 |
- "Tagespflege Altenhilfe"
|
11 |
- "Bebauungsplan der Innenentwicklung gem. § 13a BauGB - Ortskern Rütenbrock"
|
12 |
example_title: "Bebauungsplan"
|
13 |
-
datasets: and-effect/mdk_gov_data_titles_clf
|
14 |
metrics:
|
15 |
- accuracy
|
16 |
- precision
|
@@ -28,30 +28,29 @@ model-index:
|
|
28 |
revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
|
29 |
metrics:
|
30 |
- type: accuracy
|
31 |
-
value: 0.
|
32 |
name: Accuracy 'Bezeichnung'
|
33 |
- type: precision
|
34 |
-
value: 0.
|
35 |
name: Precision 'Bezeichnung' (macro)
|
36 |
- type: recall
|
37 |
-
value: 0.
|
38 |
name: Recall 'Bezeichnung' (macro)
|
39 |
- type: f1
|
40 |
-
value: 0.
|
41 |
name: Recall 'Bezeichnung' (macro)
|
42 |
- type: accuracy
|
43 |
value: 0.8934426229508197
|
44 |
name: Accuracy 'Thema'
|
45 |
- type: precision
|
46 |
-
value: 0.
|
47 |
name: Precision 'Thema' (macro)
|
48 |
- type: recall
|
49 |
-
value: 0.
|
50 |
name: Recall 'Thema' (macro)
|
51 |
- type: f1
|
52 |
-
value: 0.
|
53 |
name: Recall 'Thema' (macro)
|
54 |
-
pipeline_tag: sentence-similarity
|
55 |
---
|
56 |
|
57 |
# Model Card for Musterdatenkatalog Classifier
|
@@ -82,6 +81,7 @@ This model is based on bert-base-german-cased and fine-tuned on and-effect/mdk_g
|
|
82 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
83 |
|
84 |
## Get Started with Sentence Transformers
|
|
|
85 |
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
86 |
|
87 |
```
|
@@ -100,6 +100,7 @@ print(embeddings)
|
|
100 |
```
|
101 |
|
102 |
## Get Started with HuggingFace Transformers
|
|
|
103 |
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
104 |
|
105 |
```python
|
@@ -147,8 +148,7 @@ The model is intended to classify open source dataset titles from german municip
|
|
147 |
|
148 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
149 |
|
150 |
-
The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names.
|
151 |
-
|
152 |
|
153 |
## Recommendations
|
154 |
|
@@ -181,6 +181,7 @@ The model is fine tuned with similar and dissimilar pairs. Similar pairs are bui
|
|
181 |
|
182 |
|
183 |
## Training Parameter
|
|
|
184 |
The model was trained with the parameters:
|
185 |
|
186 |
**DataLoader**:
|
@@ -190,6 +191,7 @@ The model was trained with the parameters:
|
|
190 |
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
|
191 |
|
192 |
Hyperparameter:
|
|
|
193 |
```
|
194 |
{
|
195 |
"epochs": 3,
|
@@ -197,7 +199,6 @@ Hyperparameter:
|
|
197 |
}
|
198 |
```
|
199 |
|
200 |
-
|
201 |
### Speeds, Sizes, Times
|
202 |
|
203 |
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
@@ -211,8 +212,8 @@ All metrices express the models ability to classify dataset titles from GOVDATA
|
|
211 |
## Testing Data, Factors & Metrics
|
212 |
|
213 |
### Testing Data
|
214 |
-
The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
|
215 |
|
|
|
216 |
|
217 |
### Metrics
|
218 |
|
@@ -222,12 +223,13 @@ The model performance is tested with fours metrices. Accuracy, Precision, Recall
|
|
222 |
|
223 |
| ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
|
224 |
|-----|-----|-----|-----|-----|
|
225 |
-
| Test dataset 'Bezeichnung' I | 0.
|
226 |
-
| Test dataset 'Thema' I | 0.8934426229508197 | 0.
|
227 |
-
| Test dataset 'Bezeichnung' II | 0.
|
228 |
-
| Validation dataset 'Bezeichnung' I | 0.
|
229 |
-
| Validation dataset 'Thema' I | 0.
|
230 |
-
| Validation dataset 'Bezeichnung' II | 0.
|
|
|
231 |
|
|
|
232 |
|
233 |
-
### Summary
|
|
|
3 |
library_name: sentence-transformers
|
4 |
tags:
|
5 |
- sentence-similarity
|
6 |
+
datasets: and-effect/mdk_gov_data_titles_clf
|
7 |
widget:
|
8 |
- source_sentence: "Bebauungspläne, vorhabenbezogene Bebauungspläne (Geltungsbereiche)"
|
9 |
sentences:
|
|
|
11 |
- "Tagespflege Altenhilfe"
|
12 |
- "Bebauungsplan der Innenentwicklung gem. § 13a BauGB - Ortskern Rütenbrock"
|
13 |
example_title: "Bebauungsplan"
|
|
|
14 |
metrics:
|
15 |
- accuracy
|
16 |
- precision
|
|
|
28 |
revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
|
29 |
metrics:
|
30 |
- type: accuracy
|
31 |
+
value: 0.7336065573770492
|
32 |
name: Accuracy 'Bezeichnung'
|
33 |
- type: precision
|
34 |
+
value: 0.6611111111111111
|
35 |
name: Precision 'Bezeichnung' (macro)
|
36 |
- type: recall
|
37 |
+
value: 0.7056652046783626
|
38 |
name: Recall 'Bezeichnung' (macro)
|
39 |
- type: f1
|
40 |
+
value: 0.6674970889256604
|
41 |
name: Recall 'Bezeichnung' (macro)
|
42 |
- type: accuracy
|
43 |
value: 0.8934426229508197
|
44 |
name: Accuracy 'Thema'
|
45 |
- type: precision
|
46 |
+
value: 0.902382942746851
|
47 |
name: Precision 'Thema' (macro)
|
48 |
- type: recall
|
49 |
+
value: 0.8909340386389567
|
50 |
name: Recall 'Thema' (macro)
|
51 |
- type: f1
|
52 |
+
value: 0.8768881364249785
|
53 |
name: Recall 'Thema' (macro)
|
|
|
54 |
---
|
55 |
|
56 |
# Model Card for Musterdatenkatalog Classifier
|
|
|
81 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
82 |
|
83 |
## Get Started with Sentence Transformers
|
84 |
+
|
85 |
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
86 |
|
87 |
```
|
|
|
100 |
```
|
101 |
|
102 |
## Get Started with HuggingFace Transformers
|
103 |
+
|
104 |
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
105 |
|
106 |
```python
|
|
|
148 |
|
149 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
150 |
|
151 |
+
The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
|
|
|
152 |
|
153 |
## Recommendations
|
154 |
|
|
|
181 |
|
182 |
|
183 |
## Training Parameter
|
184 |
+
|
185 |
The model was trained with the parameters:
|
186 |
|
187 |
**DataLoader**:
|
|
|
191 |
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
|
192 |
|
193 |
Hyperparameter:
|
194 |
+
|
195 |
```
|
196 |
{
|
197 |
"epochs": 3,
|
|
|
199 |
}
|
200 |
```
|
201 |
|
|
|
202 |
### Speeds, Sizes, Times
|
203 |
|
204 |
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
|
|
212 |
## Testing Data, Factors & Metrics
|
213 |
|
214 |
### Testing Data
|
|
|
215 |
|
216 |
+
The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
|
217 |
|
218 |
### Metrics
|
219 |
|
|
|
223 |
|
224 |
| ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
|
225 |
|-----|-----|-----|-----|-----|
|
226 |
+
| Test dataset 'Bezeichnung' I | 0.7336065573770492 | 0.6611111111111111 | 0.7056652046783626 | 0.6674970889256604 |
|
227 |
+
| Test dataset 'Thema' I | 0.8934426229508197 | 0.902382942746851 | 0.8909340386389567 | 0.8768881364249785 |
|
228 |
+
| Test dataset 'Bezeichnung' II | 0.7336065573770492 | 0.5829457364341085 | 0.8229090167278661 | 0.6544072311514172 |
|
229 |
+
| Validation dataset 'Bezeichnung' I | 0.5148514851485149 | 0.346125116713352 | 0.3553921568627451 | 0.33252525252525256 |
|
230 |
+
| Validation dataset 'Thema' I | 0.7722772277227723 | 0.5908392682586232 | 0.6784524126899494 | 0.5962308463774738 |
|
231 |
+
| Validation dataset 'Bezeichnung' II | 0.5148514851485149 | 0.5768253968253969 | 0.6916666666666667 | 0.592808080808081 |
|
232 |
+
|
233 |
|
234 |
+
### Summary
|
235 |
|
|
config.json
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "models/bi_encoder_model/",
|
3 |
+
"architectures": [
|
4 |
+
"BertModel"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 768,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"intermediate_size": 3072,
|
13 |
+
"layer_norm_eps": 1e-12,
|
14 |
+
"max_position_embeddings": 512,
|
15 |
+
"model_type": "bert",
|
16 |
+
"num_attention_heads": 12,
|
17 |
+
"num_hidden_layers": 12,
|
18 |
+
"pad_token_id": 0,
|
19 |
+
"position_embedding_type": "absolute",
|
20 |
+
"torch_dtype": "float32",
|
21 |
+
"transformers_version": "4.25.1",
|
22 |
+
"type_vocab_size": 2,
|
23 |
+
"use_cache": true,
|
24 |
+
"vocab_size": 30000
|
25 |
+
}
|
config_sentence_transformers.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"__version__": {
|
3 |
+
"sentence_transformers": "2.2.2",
|
4 |
+
"transformers": "4.26.0",
|
5 |
+
"pytorch": "1.13.1"
|
6 |
+
}
|
7 |
+
}
|
modules.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
}
|
14 |
+
]
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c214c8dc99352a71934c0a2e5d67c690454426eea836c42fa6849b99a1cf8d62
|
3 |
+
size 436393773
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 128,
|
3 |
+
"do_lower_case": false
|
4 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"do_lower_case": false,
|
4 |
+
"mask_token": "[MASK]",
|
5 |
+
"model_max_length": 512,
|
6 |
+
"name_or_path": "models/bi_encoder_model/",
|
7 |
+
"pad_token": "[PAD]",
|
8 |
+
"sep_token": "[SEP]",
|
9 |
+
"special_tokens_map_file": null,
|
10 |
+
"strip_accents": null,
|
11 |
+
"tokenize_chinese_chars": true,
|
12 |
+
"tokenizer_class": "BertTokenizer",
|
13 |
+
"unk_token": "[UNK]"
|
14 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|