File size: 2,556 Bytes
ca6a6ea 85a8879 ca6a6ea 85a8879 2ae9b05 85a8879 2ae9b05 80bff58 2ae9b05 80bff58 2ae9b05 80bff58 2ae9b05 310c312 2ae9b05 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
---
license: cc-by-4.0
language: eu
tags:
- basque
- bert
---
# ElhBERTeu
This is a BERT model for Basque introduced in [BasqueGLUE: A Natural Language Understanding Benchmark for Basque]().
To train ElhBERTeu, we collected different corpora sources from several domains: updated (2021) national and local news sources, Basque Wikipedia, as well as novel news sources and texts from other domains, such as science (both academic and divulgative), literature or subtitles. More details about the corpora used and their sizes are shown in the following table. Texts from news sources were oversampled (duplicated) as done during the training of BERTeus. In total 575M tokens were used for pre-training ElhBERTeu.
|Domain | Size |
|-----------|----------|
|News | 2 x 224M |
|Wikipedia | 40M |
|Science | 58M |
|Literature | 24M |
|Others | 7M |
|Total | 575M |
ElhBERTeu is a base, uncased monolingual BERT model for Basque, with a vocab size of 50K.
ElhBERTeu was trained following the design decisions for [BERTeus](https://huggingface.co/ixa-ehu/berteus-base-cased). The tokenizer and the hyper-parameter settings remained the same, with the only difference being that the full pre-training of the model (1M steps) was performed with a sequence length of 512 on a v3-8 TPU.
The model has been evaluated on the recently created BasqueGLUE NLU benchmark:
| Model | AVG | NERC | F_intent | F_slot | BHTC | BEC | Vaxx | QNLI | WiC | coref |
|-----------|:-----:|:-----:|:---------:|:-------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
| | | F1 | F1 | F1 | F1 | F1 | MF1 | acc | acc | acc |
| BERTeus | 73.23 | 81.92 | 82.52 | 74.34 | 78.26 | 69.43 | 59.30 | 74.26 | 70.71 | 68.31 |
| ElhBERTeu | 73.71 | 82.30 | 82.24 | 75.64 | 78.05 | 69.89 | 63.81 | 73.84 | 71.71 | 65.93 |
If you use this model, please cite the following paper:
- G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June 2022. Marseille, France
@inproceedings{urbizu2022basqueglue,
title={BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
author={Gorka Urbizu and I{\~n}aki San Vicente and Xabier Saralegi and Rodrigo Agerri and Aitor Soroa},
booktitle={Proceedings of the 13th Language Resources and Evaluation Conference},
year={2022}
}
License:
CC BY 4.0 |