--- license: cc-by-nc-4.0 language: - gsw - multilingual widget: - text: "I cha etz au Schwiizerdütsch. zäme! 😊" --- The [**xlm-roberta-base**](https://huggingface.co/xlm-roberta-base) model ([Conneau et al., ACL 2020](https://aclanthology.org/2020.acl-main.747/)) trained on Swiss German text data via continued pre-training. ## Training Data For continued pre-training, we used the following two datasets of written Swiss German: 1. [SwissCrawl](https://icosys.ch/swisscrawl) ([Linder et al., LREC 2020](https://aclanthology.org/2020.lrec-1.329)), a collection of Swiss German web text (forum discussions, social media). 2. A custom dataset of Swiss German tweets In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI). ## License Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). ## Citation ```bibtex @inproceedings{vamvas-etal-2024-modular, title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect}, author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich}, booktitle={First Workshop on Modular and Open Multilingual NLP}, year={2024}, } ```