jgrosjean-mathesis
/

sentence-swissbert

Sentence Similarity

Transformers

PyTorch

xmod

Inference Endpoints

Model card Files Files and versions Community

jgrosjean commited on Feb 4

Commit

7e00454

•

1 Parent(s): 8bd6f24

Update README.md

Browse files

Files changed (1) hide show

README.md +13 -12

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ language:
 <!-- Provide a quick summary of what the model is/does. -->
-The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
 2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
@@ -115,13 +115,13 @@ The sentence swissBERT model has been trained on news articles only. Hence, it m
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) from 2022.
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized.  Because of the drop-out, it will be encoded at slightly different positions in the vector space.
 The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
@@ -130,6 +130,7 @@ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathe
 - Number of epochs: 1
 - Learning rate: 1e-5
 - Batch size: 512
 ## Evaluation
@@ -139,24 +140,24 @@ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathe
 <!-- This should link to a Dataset Card if possible. -->
-The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French and Italian using a Google Cloud API.
 #### Evaluation via Semantic Textual Similarity
 <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores between each summary and content embedding pair.
-The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
 #### Evaluation via Text Classification
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
-Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
-Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
 ### Results
@@ -169,11 +170,11 @@ Making use of an unsupervised training approach, Swissbert for Sentence Embeddin
 | Semantic Similarity FR | 82.30    | -       |**92.90**         |    -    |  91.10      |    -    |
 | Semantic Similarity IT | 83.00    | -       |**91.20**         |    -    |  89.80      |    -    |
 | Semantic Similarity RM | 78.80    | -       |**90.80**         |    -    |  67.90      |    -    |
-| Text Classification DE | 95.76    |  91.99  | 96.36            |**92.11**|  95.61      |  91.20  |
-| Text Classification FR | 94.55    | 88.52   | 95.76            |**90.94**|  94.55      |  89.82  |
-| Text Classification IT | 93.48    | 88.29   | 95.44            |  90.44  |  95.91      |**92.05**|
 | Text Classification RM |          |         |                  |         |             |         |
 #### Baseline
-The baseline uses mean pooling embeddings from the last hidden state of the original swissbert model.

 <!-- Provide a quick summary of what the model is/does. -->
+The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
 2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) up to 2023.
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+This model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The positive sequence pairs consist of the article body vs. its title and lead, wihout any hard negatives.
 The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
 - Number of epochs: 1
 - Learning rate: 1e-5
 - Batch size: 512
+- Temperature: 0.05
 ## Evaluation
 <!-- This should link to a Dataset Card if possible. -->
+The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French, Italian using a Google Cloud API and to Romash via a [Textshuttle](https://textshuttle.com/en) API.
 #### Evaluation via Semantic Textual Similarity
 <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by maximizing cosine similarity scores between each summary and content embedding pair.
+The performance is measured via accuracy, i.e. the ratio of correct vs. total matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
 #### Evaluation via Text Classification
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
+Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbors approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
+Note: For French, Italian and Romansh, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
 ### Results
 | Semantic Similarity FR | 82.30    | -       |**92.90**         |    -    |  91.10      |    -    |
 | Semantic Similarity IT | 83.00    | -       |**91.20**         |    -    |  89.80      |    -    |
 | Semantic Similarity RM | 78.80    | -       |**90.80**         |    -    |  67.90      |    -    |
+| Text Classification DE | 95.76    | 91.99   |  96.36           |**92.11**|  96.37      |  96.34  |
+| Text Classification FR | 94.55    | 88.52   |  95.76           |**90.94**|  99.35      |  99.35  |
+| Text Classification IT | 93.48    | 88.29   |  95.44           |  90.44  |  95.91      |**92.05**|
 | Text Classification RM |          |         |                  |         |             |         |
 #### Baseline
+The baseline uses mean pooling embeddings from the last hidden state of the original swissbert model and the currently best-performing Sentence-BERT model [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1)