and-effect
/

musterdatenkatalog_clf

@@ -51,132 +51,98 @@ model-index:
     - type: f1
       value: 0.88
       name: F1 'Thema' (macro)
 ---
 # Model Card for Musterdatenkatalog Classifier
-# Model Details
 ## Model Description
-This model is based on [bert-base-german-cased](https://huggingface.co/bert-base-cased) and fine-tuned on [and-effect/mdk_gov_data_titles_clf](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf).
-It was created as part of the Bertelsmann Foundation's Musterdatenkatalog (MDK) project (See their website [here](https://www.bertelsmann-stiftung.de/de/unsere-projekte/smart-country/musterdatenkatalog)).
-The main intent of the MDK project was to classify open data into a taxonomy to help give an overview of already published data.
-It can help municipalities in Germany, as well as data analysts and journalists, to see which cities have already published data sets and what might be missing.
-The project uses a taxonomy to classify the data and the model was specifically trained for the project and the classification task. It thus has a clear intended downstream task and should be used with the mentioned taxonomy.
-**Information about the underlying taxonomy:**
-The used taxonomy 'Musterdatenkatalog' has two levels: 'Thema' and 'Bezeichnung' which roughly translates to topic and label. There are 25 entries for the top level ranging from topics such as 'Finanzen' (finance) to 'Gesundheit' (health).
-The second level, 'Bezeichnung' (label) goes into more detail and would for example contain 'Krankenhaus' (hospital) in the case of the topic being health. The second level contains 241 labels. The combination of topic and label (Thema + Bezeichnung) creates a 'Musterdatensatz'.
-One can classify the data into the topics or the labels, results for both are presented down below. Although matching to other taxonomies is provdided in the published rdf version of the taxonomy (todo), the model is tailored to this taxonomy.
-- **Developed by:** and-effect
 - **Model type:** Text Classification
 - **Language(s) (NLP):** de
-- **Finetuned from model:** "bert-base-german-case. For more information one the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
-## Model Sources
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Demo:** [More Information Needed]
-# Direct Use
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-## Get Started with Sentence Transformers
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```
-pip install -U sentence-transformers
-```
-Then you can use the model like this:
-```python
-from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('{MODEL_NAME}')
-embeddings = model.encode(sentences)
-print(embeddings)
 ```
-## Get Started with HuggingFace Transformers
-Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 ```python
-from transformers import AutoTokenizer, AutoModel
-import torch
-#Mean Pooling - Take attention mask into account for correct averaging
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-# Sentences we want sentence embeddings for
-sentences = ['This is an example sentence', 'Each sentence is converted']
-# Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
-model = AutoModel.from_pretrained('{MODEL_NAME}')
-# Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-# Compute token embeddings
-with torch.no_grad():
-    model_output = model(**encoded_input)
-# Perform pooling. In this case, mean pooling.
-sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
-print("Sentence embeddings:")
-print(sentence_embeddings)
 ```
-# Downstream Use
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-The model is intended to classify open source dataset titles from german municipalities. The model is specifically tailored for this task and uses a specific taxonomy.
-More information on the taxonomy (classification categories) and the Project can be found on the [project website](https://www.bertelsmann-stiftung.de/de/unsere-projekte/smart-country/musterdatenkatalog).
-# Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-The model has some limititations. The model has some limitations in terms of the downstream task.
 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited.
-2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding of titles containing 'Corona'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names. Another systematic example is the embedding and classification of titles related to 'migration'.
 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
-## Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-# Training Details
 ## Training Data
-<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data.
 ## Training Procedure
@@ -193,7 +159,7 @@ The model is fine tuned with similar and dissimilar pairs. Similar pairs are bui
 | test_similar_pairs | 498 |
 | test_unsimilar_pairs | 249 |
 ## Training Parameter
 The model was trained with the parameters:
@@ -206,20 +172,14 @@ The model was trained with the parameters:
 Hyperparameter:
-```
 {
     "epochs": 3,
     "warmup_steps": 100,
 }
 ```
-### Speeds, Sizes, Times
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-# Evaluation
 All metrices express the models ability to classify dataset titles from GOVDATA into the taxonomy described [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For more information see VERLINKUNG MDK Projekt.
@@ -242,7 +202,6 @@ The tasks denoted with 'I' include all classes.
 The tasks are split not only into either including all classes ('I') or not ('II'), they are also divided into a task on 'Bezeichnung' or 'Thema'.
 As previously mentioned this has to do with the underlying taxonomy. The task on 'Thema' is performed on the first level of the taxonomy with 25 classes, the task on 'Bezeichnung' is performed on the second level which has 241 classes.
 ## Results
 | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
@@ -255,12 +214,6 @@ As previously mentioned this has to do with the underlying taxonomy. The task on
 | Validation dataset 'Bezeichnung' II | 0.51 | 0.58 | 0.69 | 0.59 |
 \* the accuracy in brackets was calculated with a manual analysis. This was done to check for data entries that could for example be part of more than one class and thus were actually correctly classified by the algorithm.
-In this step the correct labeling of the test data was also checked again for possible mistakes and resulted in a better performance.
-The validation dataset was created manually to check certain classes
-## Additional Information
-### Licensing Information
-CC BY 4.0

     - type: f1
       value: 0.88
       name: F1 'Thema' (macro)
+license: cc-by-4.0
 ---
 # Model Card for Musterdatenkatalog Classifier
 ## Model Description
+- **Developed by:** [and-effect](https://www.and-effect.com/)
+- **Project by**: [Bertelsmann Stiftung](https://www.bertelsmann-stiftung.de/de/startseite)
 - **Model type:** Text Classification
 - **Language(s) (NLP):** de
+- **Finetuned from model:** "bert-base-german-case. For more information on the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
+- **license**: cc-by-4.0
+## Model Sources
+- **Repository**:
+- **Demo**: [Spaces App](https://huggingface.co/spaces/and-effect/Musterdatenkatalog)
+This model is based on [bert-base-german-cased](https://huggingface.co/bert-base-cased) and fine-tuned on [and-effect/mdk_gov_data_titles_clf](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). The model was created as part of the [Bertelsmann Foundation's Musterdatenkatalog (MDK)](https://www.bertelsmann-stiftung.de/de/unsere-projekte/smart-country/musterdatenkatalog) project. The model is intended to classify open source dataset titles from german municipalities. This can help municipalities in Germany, as well as data analysts and journalists, to see which cities have already published data sets and what might be missing. The model is specifically tailored for this task and uses a specific taxonomy. It thus has a clear intended downstream task and should be used with the mentioned taxonomy.
+**Information about the underlying taxonomy:**
+The used taxonomy 'Musterdatenkatalog' has two levels: 'Thema' and 'Bezeichnung' which roughly translates to topic and label. There are 25 entries for the top level ranging from topics such as 'Finanzen' (finance) to 'Gesundheit' (health).
+The second level, 'Bezeichnung' (label) goes into more detail and would for example contain 'Krankenhaus' (hospital) in the case of the topic being health. The second level contains 241 labels. The combination of topic and label (Thema + Bezeichnung) creates a 'Musterdatensatz'. One can classify the data into the topics or the labels, results for both are presented down below. Although matching to other taxonomies is provdided in the published rdf version of the taxonomy, the model is tailored to this taxonomy. You can find the taxonomy in rdf format [here](https://huggingface.co/datasets/and-effect/MDK_taxonomy). Also have a look on our visualization of the taxonomy [here](https://huggingface.co/spaces/and-effect/Musterdatenkatalog).
+## Use model for classification
+Please make sure that you have installed the following packages:
+```bash
+pip install sentence-transformers huggingface_hub
 ```
+In order to run the algorithm use the following code:
 ```python
+import sys
+from huggingface_hub import snapshot_download
+path = snapshot_download(
+    cache_dir="tmp/",
+    repo_id="and-effect/musterdatenkatalog_clf",
+    revision="main",
+)
+sys.path.append(path)
+from pipeline import PipelineWrapper
+pipeline = PipelineWrapper(path=path)
+queries = [
+{
+"id": "1", "title": "Spielplätze"
+},
+{
+"id": "2", "title": "Berliner Weihnachtsmärkte 2022"
+},
+{
+  "id": "3", "title": "Hochschulwechslerquoten zum Masterstudium nach Bundesländern",
+}
+]
+output = pipeline(queries)
 ```
+The input data must be a list of dictionaries. Each dictionary must contain the keys 'id' and 'title'. The key title is the input for the pipeline. The output is again a list of dictionaries containing the id, the title and the key 'prediction' with the prediction of the algorithm.
+## Classification Process
+The classification is realized using semantic search. For this purpose, both the taxonomy and the queries, in this case dataset titles, are embedded with the model. Using cosine similarity, the label with the highest similarity to the query is determined.
+![](assets/semantic_search.png)
+## Direct Use
+Direct use of the model is possible with Sentence Transformers or Hugging Face Transformers. Since this model was developed only for classifying dataset titles from GOV Data into the taxonomy described above, we do not recommend using the model as an embedder for other domains.
+## Bias, Risks, and Limitations
+The model has some limititations. The model has some limitations in terms of the downstream task.
 1. **Distribution of classes**: The dataset trained on is small, but at the same time the number of classes is very high. Thus, for some classes there are only a few examples (more information about the class distribution of the training data can be found here). Consequently, the performance for smaller classes may not be as good as for the majority classes. Accordingly, the evaluation is also limited.
+2. **Systematic problems**: some subjects could not be correctly classified systematically. One example is the embedding and classification of titles related to 'migration'. In none of the evaluation cases could the titles be embedded in such a way that they corresponded to their true names.
 3. **Generalization of the model**: by using semantic search, the model is able to classify titles into new categories that have not been trained, but the model is not tuned for this and therefore the performance of the model for unseen classes is likely to be limited.
+## Training Details
 ## Training Data
+You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data. We additionally applied [AugmentedSBERT]("https://www.sbert.net/examples/training/data_augmentation/README.html) to extend the dataset for better performance.
 ## Training Procedure
 | test_similar_pairs | 498 |
 | test_unsimilar_pairs | 249 |
+We trained a CrossEncoder based on this data and used it again to generate new samplings based on the dataset titles (silver data). Using both we then fine tuned a bi-encoder, representing the resulting model.
 ## Training Parameter
 The model was trained with the parameters:
 Hyperparameter:
+```json
 {
     "epochs": 3,
     "warmup_steps": 100,
 }
 ```
+## Evaluation
 All metrices express the models ability to classify dataset titles from GOVDATA into the taxonomy described [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For more information see VERLINKUNG MDK Projekt.
 The tasks are split not only into either including all classes ('I') or not ('II'), they are also divided into a task on 'Bezeichnung' or 'Thema'.
 As previously mentioned this has to do with the underlying taxonomy. The task on 'Thema' is performed on the first level of the taxonomy with 25 classes, the task on 'Bezeichnung' is performed on the second level which has 241 classes.
 ## Results
 | ***task*** | ***acccuracy*** | ***precision (macro)*** | ***recall (macro)*** | ***f1 (macro)*** |
 | Validation dataset 'Bezeichnung' II | 0.51 | 0.58 | 0.69 | 0.59 |
 \* the accuracy in brackets was calculated with a manual analysis. This was done to check for data entries that could for example be part of more than one class and thus were actually correctly classified by the algorithm.
+In this step the correct labeling of the test data was also checked again for possible mistakes and resulted in a better performance.
+The validation dataset was created manually to check certain classes.

assets/semantic_search.png ADDED Viewed

pipeline.py CHANGED Viewed

@@ -1,4 +1,3 @@
-from typing import Any, Dict, List
 from sentence_transformers import SentenceTransformer
 from sentence_transformers import util
 import torch

 from sentence_transformers import SentenceTransformer
 from sentence_transformers import util
 import torch