Rahkakavee Baskaran commited on
Commit
66bfffb
1 Parent(s): 3faa039
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -65,7 +65,6 @@ license: cc-by-4.0
65
  - **Finetuned from model:** "bert-base-german-case. For more information on the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
66
  - **license**: cc-by-4.0
67
 
68
-
69
  ## Model Sources
70
 
71
  - **Repository**:
@@ -120,11 +119,13 @@ output = pipeline(queries)
120
 
121
  The input data must be a list of dictionaries. Each dictionary must contain the keys 'id' and 'title'. The key title is the input for the pipeline. The output is again a list of dictionaries containing the id, the title and the key 'prediction' with the prediction of the algorithm.
122
 
 
 
123
  ## Classification Process
124
 
125
  The classification is realized using semantic search. For this purpose, both the taxonomy and the queries, in this case dataset titles, are embedded with the model. Using cosine similarity, the label with the highest similarity to the query is determined.
126
 
127
- ![](assets/semantic_search.png)
128
 
129
  ## Direct Use
130
 
@@ -140,11 +141,10 @@ The model has some limititations. The model has some limitations in terms of the
140
 
141
  ## Training Details
142
 
143
- ## Training Data
144
 
145
  You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data. We additionally applied [AugmentedSBERT]("https://www.sbert.net/examples/training/data_augmentation/README.html) to extend the dataset for better performance.
146
 
147
- ## Training Procedure
148
 
149
  ### Preprocessing
150
 
@@ -160,7 +160,8 @@ The model is fine tuned with similar and dissimilar pairs. Similar pairs are bui
160
  | test_unsimilar_pairs | 249 |
161
 
162
  We trained a CrossEncoder based on this data and used it again to generate new samplings based on the dataset titles (silver data). Using both we then fine tuned a bi-encoder, representing the resulting model.
163
- ## Training Parameter
 
164
 
165
  The model was trained with the parameters:
166
 
@@ -170,7 +171,7 @@ The model was trained with the parameters:
170
  **Loss**:
171
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
172
 
173
- Hyperparameter:
174
 
175
  ```json
176
  {
 
65
  - **Finetuned from model:** "bert-base-german-case. For more information on the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
66
  - **license**: cc-by-4.0
67
 
 
68
  ## Model Sources
69
 
70
  - **Repository**:
 
119
 
120
  The input data must be a list of dictionaries. Each dictionary must contain the keys 'id' and 'title'. The key title is the input for the pipeline. The output is again a list of dictionaries containing the id, the title and the key 'prediction' with the prediction of the algorithm.
121
 
122
+ If you want to predict only a few titles or test the model, you can also take a look at our algorithm demo [here](https://huggingface.co/spaces/and-effect/Musterdatenkatalog).
123
+
124
  ## Classification Process
125
 
126
  The classification is realized using semantic search. For this purpose, both the taxonomy and the queries, in this case dataset titles, are embedded with the model. Using cosine similarity, the label with the highest similarity to the query is determined.
127
 
128
+ ![Semmantic Search](assets/semantic_search.png)
129
 
130
  ## Direct Use
131
 
 
141
 
142
  ## Training Details
143
 
144
+ ### Training Data
145
 
146
  You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data. We additionally applied [AugmentedSBERT]("https://www.sbert.net/examples/training/data_augmentation/README.html) to extend the dataset for better performance.
147
 
 
148
 
149
  ### Preprocessing
150
 
 
160
  | test_unsimilar_pairs | 249 |
161
 
162
  We trained a CrossEncoder based on this data and used it again to generate new samplings based on the dataset titles (silver data). Using both we then fine tuned a bi-encoder, representing the resulting model.
163
+
164
+ ### Training Parameter
165
 
166
  The model was trained with the parameters:
167
 
 
171
  **Loss**:
172
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
173
 
174
+ Hyperparameters:
175
 
176
  ```json
177
  {