File size: 4,147 Bytes
ca6a6ea
 
85a8879
 
 
ddc68d9
 
 
ca6a6ea
85a8879
2ae9b05
85a8879
63af604
2ae9b05
 
 
 
80bff58
2ae9b05
 
 
 
 
 
 
8d4de0a
2ae9b05
4a88c50
 
706126a
2ae9b05
3ca4034
2ae9b05
a1b8ea8
 
 
d5a5078
 
2ae9b05
 
 
 
 
be00833
fe91db1
 
 
 
 
 
 
 
 
 
 
310c312
be00833
310c312
2ae9b05
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: cc-by-4.0
language: eu
tags:
- bert
- basque
- euskara

---

# ElhBERTeu

This is a BERT model for Basque introduced in [BasqueGLUE: A Natural Language Understanding Benchmark for Basque](https://aclanthology.org/2022.lrec-1.172/).

To train ElhBERTeu, we collected different corpora sources from several domains: updated (2021) national and local news sources, Basque Wikipedia, as well as novel news sources and texts from other domains, such as science (both academic and divulgative), literature or subtitles. More details about the corpora used and their sizes are shown in the following table. Texts from news sources were oversampled (duplicated) as done during the training of BERTeus. In total 575M tokens were used for pre-training ElhBERTeu.

|Domain     | Size     |
|-----------|----------|
|News       | 2 x 224M |
|Wikipedia  | 40M      |
|Science    | 58M      |
|Literature | 24M      |
|Others     | 7M       |
|Total      | 575M     |

ElhBERTeu is a base, cased monolingual BERT model for Basque, with a vocab size of 50K, which has 124M parameters in total.

There is a medium-size model available here: [ElhBERTeu-medium](https://huggingface.co/orai-nlp/ElhBERTeu-medium)

ElhBERTeu was trained following the design decisions for [BERTeus](https://huggingface.co/ixa-ehu/berteus-base-cased). The tokenizer and the hyper-parameter settings remained the same (batch_size=256), with the only difference being that the full pre-training of the model (1M steps) was performed with a sequence length of 512 on a v3-8 TPU.

The model has been evaluated on the recently created [BasqueGLUE](https://github.com/Elhuyar/BasqueGLUE) NLU benchmark:

| Model     |  AVG      |  NERC     |  F_intent | F_slot  |  BHTC   |  BEC    |  Vaxx   |  QNLI   |  WiC    | coref   |
|-----------|:---------:|:---------:|:---------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|
|           |           |   F1      |    F1     |   F1    |   F1    |   F1    |  MF1    |  acc    |  acc    |  acc    |
| BERTeus   |   73.23   |   81.92   | **82.52** |  74.34  |**78.26**|  69.43  |  59.30  |**74.26**|  70.71  |**68.31**|
| ElhBERTeu | **73.71** | **82.30** |   82.24   |**75.64**|  78.05  |**69.89**|**63.81**|  73.84  |**71.71**|  65.93  |

If you use this model, please cite the following paper:

- G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June 2022. Marseille, France

```
@InProceedings{urbizu2022basqueglue,
  author    = {Urbizu, Gorka  and  San Vicente, Iñaki  and  Saralegi, Xabier  and  Agerri, Rodrigo  and  Soroa, Aitor},
  title     = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1603--1612},
  abstract  = {Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.},
  url       = {https://aclanthology.org/2022.lrec-1.172}
}
```

License:
CC BY 4.0