Update README.md
Browse files
README.md
CHANGED
@@ -6,11 +6,9 @@
|
|
6 |
|
7 |
<!-- Provide a quick summary of what the model is/does. -->
|
8 |
|
9 |
-
The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
|
10 |
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
|
11 |
|
12 |
-
The fine-tuning script can be accessed [here](Link).
|
13 |
-
|
14 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
|
15 |
|
16 |
## Model Details
|
@@ -22,7 +20,7 @@ The fine-tuning script can be accessed [here](Link).
|
|
22 |
- **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
|
23 |
- **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
|
24 |
- **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH
|
25 |
-
- **License:**
|
26 |
- **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)
|
27 |
|
28 |
## Use
|
@@ -107,16 +105,15 @@ This model has been trained on news articles only. Hence, it might not perform a
|
|
107 |
|
108 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
109 |
|
110 |
-
[
|
111 |
|
112 |
### Training Procedure
|
113 |
|
114 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
115 |
|
116 |
-
|
117 |
-
|
118 |
-
[More Information Needed]
|
119 |
|
|
|
120 |
|
121 |
#### Training Hyperparameters
|
122 |
|
@@ -130,46 +127,35 @@ Batch size: 512
|
|
130 |
|
131 |
### Testing Data, Factors & Metrics
|
132 |
|
133 |
-
####
|
134 |
-
|
135 |
-
<!-- This should link to a Dataset Card if possible. -->
|
136 |
-
|
137 |
-
[More Information Needed]
|
138 |
-
|
139 |
-
#### Factors
|
140 |
|
141 |
-
|
142 |
-
|
143 |
-
[More Information Needed]
|
144 |
|
145 |
-
|
146 |
|
147 |
-
|
148 |
|
149 |
-
|
150 |
|
151 |
-
|
152 |
|
153 |
-
|
154 |
|
155 |
-
|
156 |
|
|
|
157 |
|
|
|
158 |
|
159 |
-
## Environmental Impact
|
160 |
|
161 |
-
|
162 |
|
163 |
-
|
164 |
|
165 |
-
|
166 |
-
- **Hours used:** [More Information Needed]
|
167 |
-
- **Cloud Provider:** [More Information Needed]
|
168 |
-
- **Compute Region:** [More Information Needed]
|
169 |
-
- **Carbon Emitted:** [More Information Needed]
|
170 |
|
171 |
-
|
172 |
|
173 |
-
###
|
174 |
|
175 |
[More Information Needed]
|
|
|
6 |
|
7 |
<!-- Provide a quick summary of what the model is/does. -->
|
8 |
|
9 |
+
The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
|
10 |
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
|
11 |
|
|
|
|
|
12 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
|
13 |
|
14 |
## Model Details
|
|
|
20 |
- **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
|
21 |
- **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
|
22 |
- **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH
|
23 |
+
- **License:** Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
|
24 |
- **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)
|
25 |
|
26 |
## Use
|
|
|
105 |
|
106 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
107 |
|
108 |
+
German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) from 2022.
|
109 |
|
110 |
### Training Procedure
|
111 |
|
112 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
113 |
|
114 |
+
This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized. Due to the drop-out, it will be encoded at slightly different positions in vector space.
|
|
|
|
|
115 |
|
116 |
+
The fine-tuning script can be accessed [here](Link).
|
117 |
|
118 |
#### Training Hyperparameters
|
119 |
|
|
|
127 |
|
128 |
### Testing Data, Factors & Metrics
|
129 |
|
130 |
+
#### Baseline
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
+
The first baseline is [distiluse-base-multilingual-cased](https://www.sbert.net/examples/training/multilingual/README.html), a high-performing Sentence Transformer model that is able to process German, French and Italian (and more).
|
|
|
|
|
133 |
|
134 |
+
The second baseline uses mean pooling embeddings from the last hidden state of the original swissbert model.
|
135 |
|
136 |
+
#### Testing Data
|
137 |
|
138 |
+
<!-- This should link to a Dataset Card if possible. -->
|
139 |
|
140 |
+
The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French and Italian using a Google Cloud API.
|
141 |
|
142 |
+
#### Evaluation via Semantic Textual Similarity
|
143 |
|
144 |
+
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
145 |
|
146 |
+
Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores betweend each summary and content embedding pair.
|
147 |
|
148 |
+
The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches.
|
149 |
|
|
|
150 |
|
151 |
+
#### Evaluation via Text Classification
|
152 |
|
153 |
+
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
154 |
|
155 |
+
Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach.
|
|
|
|
|
|
|
|
|
156 |
|
157 |
+
Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
|
158 |
|
159 |
+
### Results
|
160 |
|
161 |
[More Information Needed]
|