metadata
license: cc-by-nc-4.0
language:
- gsw
- multilingual
widget:
- text: I cha etz au Schwiizerdütsch. <mask> zäme! 😊
The xlm-roberta-base model (Conneau et al., ACL 2020) trained on Swiss German text data via continued pre-training.
Training Data
For continued pre-training, we used the following two datasets of written Swiss German:
- SwissCrawl (Linder et al., LREC 2020), a collection of Swiss German web text (forum discussions, social media).
- A custom dataset of Swiss German tweets
In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from Swissdox@LiRI.
License
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
Citation
@inproceedings{vamvas-etal-2024-modular,
title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect},
author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich},
booktitle={First Workshop on Modular and Open Multilingual NLP},
year={2024},
}