Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ language:
|
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
11 |
-
The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via
|
12 |
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
|
13 |
|
14 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
|
@@ -115,13 +115,13 @@ The sentence swissBERT model has been trained on news articles only. Hence, it m
|
|
115 |
|
116 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
117 |
|
118 |
-
German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI)
|
119 |
|
120 |
### Training Procedure
|
121 |
|
122 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
123 |
|
124 |
-
This model was finetuned via
|
125 |
|
126 |
The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
|
127 |
|
@@ -130,6 +130,7 @@ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathe
|
|
130 |
- Number of epochs: 1
|
131 |
- Learning rate: 1e-5
|
132 |
- Batch size: 512
|
|
|
133 |
|
134 |
## Evaluation
|
135 |
|
@@ -139,24 +140,24 @@ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathe
|
|
139 |
|
140 |
<!-- This should link to a Dataset Card if possible. -->
|
141 |
|
142 |
-
The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French
|
143 |
|
144 |
#### Evaluation via Semantic Textual Similarity
|
145 |
|
146 |
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
147 |
|
148 |
-
Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by
|
149 |
|
150 |
-
The performance is measured via accuracy, i.e. the ratio of correct vs.
|
151 |
|
152 |
|
153 |
#### Evaluation via Text Classification
|
154 |
|
155 |
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
156 |
|
157 |
-
Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest
|
158 |
|
159 |
-
Note: For French and
|
160 |
|
161 |
### Results
|
162 |
|
@@ -169,11 +170,11 @@ Making use of an unsupervised training approach, Swissbert for Sentence Embeddin
|
|
169 |
| Semantic Similarity FR | 82.30 | - |**92.90** | - | 91.10 | - |
|
170 |
| Semantic Similarity IT | 83.00 | - |**91.20** | - | 89.80 | - |
|
171 |
| Semantic Similarity RM | 78.80 | - |**90.80** | - | 67.90 | - |
|
172 |
-
| Text Classification DE | 95.76 |
|
173 |
-
| Text Classification FR | 94.55 | 88.52 |
|
174 |
-
| Text Classification IT | 93.48 | 88.29 |
|
175 |
| Text Classification RM | | | | | | |
|
176 |
|
177 |
#### Baseline
|
178 |
|
179 |
-
The baseline uses mean pooling embeddings from the last hidden state of the original swissbert model.
|
|
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
11 |
+
The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
|
12 |
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
|
13 |
|
14 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
|
|
|
115 |
|
116 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
117 |
|
118 |
+
German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) up to 2023.
|
119 |
|
120 |
### Training Procedure
|
121 |
|
122 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
123 |
|
124 |
+
This model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The positive sequence pairs consist of the article body vs. its title and lead, wihout any hard negatives.
|
125 |
|
126 |
The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
|
127 |
|
|
|
130 |
- Number of epochs: 1
|
131 |
- Learning rate: 1e-5
|
132 |
- Batch size: 512
|
133 |
+
- Temperature: 0.05
|
134 |
|
135 |
## Evaluation
|
136 |
|
|
|
140 |
|
141 |
<!-- This should link to a Dataset Card if possible. -->
|
142 |
|
143 |
+
The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French, Italian using a Google Cloud API and to Romash via a [Textshuttle](https://textshuttle.com/en) API.
|
144 |
|
145 |
#### Evaluation via Semantic Textual Similarity
|
146 |
|
147 |
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
148 |
|
149 |
+
Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by maximizing cosine similarity scores between each summary and content embedding pair.
|
150 |
|
151 |
+
The performance is measured via accuracy, i.e. the ratio of correct vs. total matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
|
152 |
|
153 |
|
154 |
#### Evaluation via Text Classification
|
155 |
|
156 |
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
157 |
|
158 |
+
Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbors approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
|
159 |
|
160 |
+
Note: For French, Italian and Romansh, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
|
161 |
|
162 |
### Results
|
163 |
|
|
|
170 |
| Semantic Similarity FR | 82.30 | - |**92.90** | - | 91.10 | - |
|
171 |
| Semantic Similarity IT | 83.00 | - |**91.20** | - | 89.80 | - |
|
172 |
| Semantic Similarity RM | 78.80 | - |**90.80** | - | 67.90 | - |
|
173 |
+
| Text Classification DE | 95.76 | 91.99 | 96.36 |**92.11**| 96.37 | 96.34 |
|
174 |
+
| Text Classification FR | 94.55 | 88.52 | 95.76 |**90.94**| 99.35 | 99.35 |
|
175 |
+
| Text Classification IT | 93.48 | 88.29 | 95.44 | 90.44 | 95.91 |**92.05**|
|
176 |
| Text Classification RM | | | | | | |
|
177 |
|
178 |
#### Baseline
|
179 |
|
180 |
+
The baseline uses mean pooling embeddings from the last hidden state of the original swissbert model and the currently best-performing Sentence-BERT model [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1)
|