The xlm-roberta-base model (Conneau et al., ACL 2020) trained on Swiss German text data via continued pre-training.

Training Data

For continued pre-training, we used the following two datasets of written Swiss German:

  1. SwissCrawl (Linder et al., LREC 2020), a collection of Swiss German web text (forum discussions, social media).
  2. A custom dataset of Swiss German tweets

In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from Swissdox@LiRI.

License

Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Citation

@inproceedings{vamvas-etal-2024-modular,
      title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect},
      author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich},
      booktitle={First Workshop on Modular and Open Multilingual NLP},
      year={2024},
}
Downloads last month
20
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.