berteus-base-cased / README.md
julien-c's picture
julien-c HF staff
Migrate model card from transformers-repo
02977b6
metadata
language: eu

BERTeus base cased

This is the Basque language pretrained model presented in Give your Text Representation Models some Love: the Case for Basque. This model has been trained on a Basque corpus comprising Basque crawled news articles from online newspapers and the Basque Wikipedia. The training corpus contains 224.6 million tokens, of which 35 million come from the Wikipedia.

BERTeus has been tested on four different downstream tasks for Basque: part-of-speech (POS) tagging, named entity recognition (NER), sentiment analysis and topic classification; improving the state of the art for all tasks. See summary of results below:

Downstream task BERTeus mBERT Previous SOTA
Topic Classification 76.77 68.42 63.00
Sentiment 78.10 71.02 74.02
POS 97.76 96.37 96.10
NER 87.06 81.52 76.72

If using this model, please cite the following paper:

@inproceedings{agerri2020give,
  title={Give your Text Representation Models some Love: the Case for Basque},
  author={Rodrigo Agerri and I{\~n}aki San Vicente and Jon Ander Campos and Ander Barrena and Xabier Saralegi and Aitor Soroa and Eneko Agirre},
  booktitle={Proceedings of the 12th International Conference on Language Resources and Evaluation},
  year={2020}
}