jgrosjean-mathesis
/

sentence-swissbert

Sentence Similarity

Transformers

PyTorch

xmod

Inference Endpoints

Model card Files Files and versions Community

jgrosjean commited on Dec 18, 2023

Commit

a61b682

•

1 Parent(s): 25f607a

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -34

README.md CHANGED Viewed

@@ -6,11 +6,9 @@
 <!-- Provide a quick summary of what the model is/does. -->
-The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
 2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
-The fine-tuning script can be accessed [here](Link).
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
 ## Model Details
@@ -22,7 +20,7 @@ The fine-tuning script can be accessed [here](Link).
 - **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
 - **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
 - **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH
-- **License:** [More Information Needed]
 - **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)
 ## Use
@@ -107,16 +105,15 @@ This model has been trained on news articles only. Hence, it might not perform a
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
@@ -130,46 +127,35 @@ Batch size: 512
 ### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
 [More Information Needed]

 <!-- Provide a quick summary of what the model is/does. -->
+The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
 2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
 ## Model Details
 - **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
 - **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
 - **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH
+- **License:** Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
 - **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)
 ## Use
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) from 2022.
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized.  Due to the drop-out, it will be encoded at slightly different positions in vector space.
+The fine-tuning script can be accessed [here](Link).
 #### Training Hyperparameters
 ### Testing Data, Factors & Metrics
+#### Baseline
+The first baseline is [distiluse-base-multilingual-cased](https://www.sbert.net/examples/training/multilingual/README.html), a high-performing Sentence Transformer model that is able to process German, French and Italian (and more).
+The second baseline uses mean pooling embeddings from the last hidden state of the original swissbert model.
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French and Italian using a Google Cloud API.
+#### Evaluation via Semantic Textual Similarity
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores betweend each summary and content embedding pair.
+The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches.
+#### Evaluation via Text Classification
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach.
+Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
+### Results
 [More Information Needed]