File size: 9,741 Bytes
bdf1be2 6cda1c9 bdf1be2 6cda1c9 8a8fe78 9d6cf63 6cda1c9 9d6cf63 6cda1c9 49bca5a 6cda1c9 9c7c693 8a02a17 6cda1c9 4469960 87ff95b 339a1c4 2c39ce6 bdf1be2 49bca5a bdf1be2 6cda1c9 b744b0d 6cda1c9 b744b0d 6cda1c9 93c32a1 6cda1c9 b744b0d bdf1be2 6cda1c9 bdf1be2 93c32a1 bdf1be2 6cda1c9 49bca5a 6cda1c9 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 6cda1c9 93c32a1 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 93c32a1 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 6cda1c9 bdf1be2 b744b0d bdf1be2 93c32a1 6cda1c9 93c32a1 6cda1c9 49bca5a 93c32a1 49bca5a 6cda1c9 93c32a1 6cda1c9 b719141 6cda1c9 b719141 6cda1c9 2b1ceeb b719141 dd85427 93c32a1 2c39ce6 6cda1c9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
---
language: de
library_name: sentence_transformers
tags:
- text-classification
datasets: and-effect/mdk_gov_data_titles_clf
metrics:
- accuracy
- precision
- recall
- f1
model-index:
- name: musterdatenkatalog_clf
results:
- task:
type: text-classification
dataset:
name: and-effect/mdk_gov_data_titles_clf
type: and-effect/mdk_gov_data_titles_clf
split: test
revision: 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6
metrics:
- type: accuracy
value: 0.7004405286343612
name: Accuracy 'Bezeichnung'
- type: precision
value: 0.5717666948436179
name: Precision 'Bezeichnung' (macro)
- type: recall
value: 0.6127063220180629
name: Recall 'Bezeichnung' (macro)
- type: f1
value: 0.5805958812647776
name: Recall 'Bezeichnung' (macro)
- type: accuracy
value: 0.9162995594713657
name: Accuracy 'Thema'
- type: precision
value: 0.9318954248366014
name: Precision 'Thema' (macro)
- type: recall
value: 0.9122380952380952
name: Recall 'Thema' (macro)
- type: f1
value: 0.8984289453766925
name: Recall 'Thema' (macro)
---
# Model Card for Musterdatenkatalog Classifier
# Model Details
## Model Description
This model is based on bert-base-german-cased and fine-tuned on and-effect/mdk_gov_data_titles_clf. This model reaches an accuracy of XY on the test set and XY on the validation set
- **Developed by:** and-effect
- **Shared by:** [More Information Needed]
- **Model type:** Text Classification
- **Language(s) (NLP):** de
- **License:** [More Information Needed]
- **Finetuned from model:** "bert-base-german-case. For more information one the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
## Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [More Information Needed]
- **Paper:** [More Information Needed]
- **Demo:** [More Information Needed]
# Direct Use
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
## Get Started with Sentence Transformers
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
```
## Get Started with HuggingFace Transformers
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
```
# Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
The model is intended to classify open source dataset titles from german municipalities. More information on the Taxonomy (classification categories) and the Project can be found on XY.
[More Information Needed on downstream_use_demo]
# Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The model has some limititations. The model has some limitations in terms of the downstream task. \n 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited. \n 2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'. \n 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
## Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
# Training Details
## Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data.
## Training Procedure
### Preprocessing
This section describes the generating of the input data for the model. More information on the preprocessing of the data itself can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf)
The model is fine tuned with similar and dissimilar pairs. Similar pairs are built with all titles and their true label. Dissimilar pairs defined as pairs of title and all labels, except the true label. Since the combinations of dissimilar is much higher, a sample of two pairs per title is selected.
| pairs | size |
|-----|-----|
| train_similar_pairs | 2018 |
| train_unsimilar_pairs | 1009 |
| test_similar_pairs | 498 |
| test_unsimilar_pairs | 249 |
## Training Parameter
The model was trained with the parameters:
**DataLoader**:
`torch.utils.data.dataloader.DataLoader`
**Loss**:
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
Hyperparameter:
```
{
"epochs": 3,
"warumup_steps": [More Information Needed],
}
```
### Speeds, Sizes, Times
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
[More Information Needed]
# Evaluation
All metrices express the models ability to classify dataset titles from GOVDATA into the taxonomy described [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For more information see VERLINKUNG MDK Projekt.
## Testing Data, Factors & Metrics
### Testing Data
The evaluation data can be found [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). Since the model is trained on revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 for evaluation, the evaluation metrics rely on the same revision.
### Metrics
The model performance is tested with fours metrices. Accuracy, Precision, Recall and F1 Score. A lot of classes were not predicted and are thus set to zero for the calculation of precision, recall and f1 score. For these metrices the additional calucations were performed exluding classes with less than two predictions for the level 'Bezeichnung' (see in table results 'Bezeichnung II'. Although intepretation of these results should be interpreted with caution, because they do not represent all classes.
## Results
| ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
|-----|-----|-----|-----|-----|
| Test dataset 'Bezeichnung' I | 0.7004405286343612 | 0.5717666948436179 | 0.6127063220180629 | 0.5805958812647776 |
| Test dataset 'Thema' I | 0.9162995594713657 | 0.9318954248366014 | 0.9122380952380952 | 0.8984289453766925 |
| Test dataset 'Bezeichnung' II | 0.7004405286343612 | 0.573015873015873 | 0.8207602339181287 | 0.6515010351966875 |
| Validation dataset 'Bezeichnung' I | 0.5445544554455446 | 0.41787439613526567 | 0.39929183135704877 | 0.4010173484686228 |
| Validation dataset 'Thema' I | 0.801980198019802 | 0.6433080808080808 | 0.7039711632453568 | 0.6591710279769981 |
| Validation dataset 'Bezeichnung' II | 0.5445544554455446 | 0.6018518518518519 | 0.6278409090909091 | 0.6066776135741653 |
### Summary
|