File size: 8,921 Bytes
bd7b852
c03353a
 
 
 
 
bd7b852
 
 
 
6ba6e16
3e2d610
 
e2f0db1
3b0fa51
bd7b852
 
 
 
 
 
dbc65e7
 
9abb0dd
a61b682
dbc65e7
bd7b852
dbc65e7
bd7b852
dbc65e7
bd7b852
dbc65e7
 
bd7b852
dbc65e7
bd7b852
ef4d93a
8d1c4fd
ef4d93a
 
bd7b852
d8f35d4
 
 
 
5b5aee3
d8f35d4
5b5aee3
d8f35d4
5b5aee3
d8f35d4
5b5aee3
1e41992
dbc65e7
 
bd7b852
dbc65e7
 
 
bd7b852
1e41992
 
 
 
 
 
bd7b852
dbc65e7
 
6664ad6
 
 
1471909
dbc65e7
 
 
 
 
 
 
 
bd7b852
d8f35d4
 
 
 
 
 
 
 
 
bbab9f8
d8f35d4
 
 
bbab9f8
5b5aee3
d8f35d4
bbab9f8
d8f35d4
 
5b5aee3
 
 
 
d8f35d4
bd7b852
 
 
8bd6f24
bd7b852
 
 
 
 
 
 
7e00454
bd7b852
 
 
 
 
7e00454
bd7b852
3a17ec9
bd7b852
 
 
bbab9f8
 
 
7e00454
bd7b852
 
 
 
 
a61b682
bd7b852
a61b682
bd7b852
7e00454
bd7b852
a61b682
bd7b852
a61b682
bd7b852
7e00454
bd7b852
3a17ec9
bd7b852
 
a61b682
bd7b852
a61b682
bd7b852
3a17ec9
bd7b852
7e00454
bd7b852
a61b682
bd7b852
d4f0c20
bd77bbb
80ea5b2
 
 
2e83ad6
 
 
 
 
 
 
 
fa4754b
31c7b29
fa4754b
d4f0c20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
language:
- de
- fr
- it
- rm
---

<!-- Provide a quick summary of what the model is/does. -->

The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1.5 million Swiss news articles from up to 2023 retireved via [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/owPx_nx0evzbl8aGm-UzQ.png)

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
- **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
- **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH
- **License:** Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
- **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)

## Use

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

```python
import torch

from transformers import AutoModel, AutoTokenizer

# Load swissBERT for sentence embeddings model
model_name = "jgrosjean-mathesis/sentence-swissbert"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def generate_sentence_embedding(sentence, language):

    # Set adapter to specified language
    if "de" in language:
        model.set_default_language("de_CH")
    if "fr" in language:
        model.set_default_language("fr_CH")
    if "it" in language:
        model.set_default_language("it_CH")
    if "rm" in language:
        model.set_default_language("rm_CH")

    # Tokenize input sentence
    inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512)

    # Take tokenized input and pass it through the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract sentence embeddings via mean pooling
    token_embeddings = outputs.last_hidden_state
    attention_mask = inputs['attention_mask'].unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * attention_mask, 1)
    sum_mask = torch.clamp(attention_mask.sum(1), min=1e-9)
    embedding = sum_embeddings / sum_mask

    return embedding

# Try it out
sentence_0 = "Wir feiern am 1. August den Schweizer Nationalfeiertag."
sentence_0_embedding = generate_sentence_embedding(sentence_0, language="de")
print(sentence_0_embedding)
```
Output:
```
tensor([[ 5.6306e-02, -2.8375e-01, -4.1495e-02,  7.4393e-02, -3.1552e-01,
          1.5213e-01, -1.0258e-01,  2.2790e-01, -3.5968e-02,  3.1769e-01,
          1.9354e-01,  1.9748e-02, -1.5236e-01, -2.2657e-01,  1.3345e-02,
        ...]])
```

### Semantic Textual Similarity

```python
from sklearn.metrics.pairwise import cosine_similarity

# Define two sentences
sentence_1 = ["Der Zug kommt um 9 Uhr in Zürich an."]
sentence_2 = ["Le train arrive à Lausanne à 9h."]

# Compute embedding for both
embedding_1 = generate_sentence_embedding(sentence_1, language="de")
embedding_2 = generate_sentence_embedding(sentence_2, language="fr")

# Compute cosine-similarity
cosine_score = cosine_similarity(embedding_1, embedding_2)

# Output the score
print("The cosine score for", sentence_1, "and", sentence_2, "is", cosine_score)
```
Output:
```
The cosine score for ['Der Zug kommt um 9 Uhr in Zürich an.'] and ['Le train arrive à Lausanne à 9h.'] is [[0.85555995]]
```

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The sentence swissBERT model has been trained on news articles only. Hence, it might not perform as well on other text classes. Furthermore, it is specific to a Switzerland-related context, which means it probably does not perform as well on text that does not fall in that category. Additionally, the model has neither been trained nor evaluated for machine translation tasks.

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) up to 2023.

### Training Procedure 

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

This model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The positive sequence pairs consist of the article body vs. its title and lead, wihout any hard negatives.

The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/sentence-swissbert/tree/main/training).

#### Training Hyperparameters

- Number of epochs: 1
- Learning rate: 1e-5
- Batch size: 512
- Temperature: 0.05

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French, Italian using a Google Cloud API and to Romash via a [Textshuttle](https://textshuttle.com/en) API.

#### Evaluation via Semantic Textual Similarity

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by maximizing cosine similarity scores between each summary and content embedding pair.

The performance is measured via accuracy, i.e. the ratio of correct vs. total matches. The script can be found [here](https://github.com/jgrosjean-mathesis/sentence-swissbert/tree/main/evaluation).


#### Evaluation via Text Classification

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbors approach. The script can be found [here](https://github.com/jgrosjean-mathesis/sentence-swissbert/tree/main/evaluation).

Note: For French, Italian and Romansh, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.

### Results

Sentence SwissBERT achieves comparable or better results as the best-performing multilingual Sentence-BERT model in these tasks (distiluse-base-multilingual-cased). It outperforms it in all evaluation task, except for the text classification in Italian.

| Evaluation task        |Swissbert |           |Sentence Swissbert|           |Sentence-BERT|           |
|------------------------|----------|-----------|------------------|-----------|-------------|-----------|
|                        |accuracy  |f1-score   |accuracy          |f1-score   |accuracy     |f1-score   |
| Semantic Similarity DE | 87.20 %  |    --     |  **93.40 %**     |    --     |  91.80 %    |    --     |
| Semantic Similarity FR | 84.97 %  |    --     |  **93.99 %**     |    --     |  93.19 %    |    --     |
| Semantic Similarity IT | 84.17 %  |    --     |  **92.18 %**     |    --     |  91.58 %    |    --     |
| Semantic Similarity RM | 83.17 %  |    --     |  **91.58 %**     |    --     |  73.35 %    |    --     |
| Text Classification DE |   --     |           |       --         |**78.49 %**|     --      |  77.23 %  |
| Text Classification FR |   --     |           |       --         |**77.18 %**|     --      |  76.83 %  |
| Text Classification IT |   --     |           |       --         |  76.65 %  |     --      |**76.90 %**|
| Text Classification RM |   --     |           |       --         |**77.20 %**|     --      |  65.35 %  |

#### Baseline

The baseline uses mean pooling embeddings from the last hidden state of the original swissbert model and (in this task) best-performing Sentence-BERT model [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1)