orai-nlp
/

ElhBERTeu

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

Gorka Urbizu Garmendia commited on May 6, 2022

Commit

80bff58

•

1 Parent(s): 7676be5

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ This is a BERT model for Basque introduced in [BasqueGLUE: A Natural Language Un
 To train ElhBERTeu, we collected different corpora sources from several domains: updated (2021) national and local news sources, Basque Wikipedia, as well as novel news sources and texts from other domains, such as science (both academic and divulgative), literature or subtitles. More details about the corpora used and their sizes are shown in the following table. Texts from news sources were oversampled (duplicated) as done during the training of BERTeus. In total 575M tokens were used for pre-training ElhBERTeu.
 |Domain     | Size     |
-|:----------|----------|
 |News       | 2 x 224M |
 |Wikipedia  | 40M      |
 |Science    | 58M      |
@@ -27,9 +27,9 @@ ElhBERTeu was trained following the design decisions for [BERTeus](https://huggi
 The model has been evaluated on the recently created BasqueGLUE NLU benchmark:
-|           |  AVG  |  NERC |  F_intent | F_slot  |  BHTC |  BEC  |  Vaxx |  QNLI |  WiC  | coref |
 |-----------|:-----:|:-----:|:---------:|:-------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
-| Model     |       |   F1  |    F1     |   F1    |   F1  |   F1  |  MF1  |  acc  |  acc  |  acc  |
 | BERTeus   | 73.23 | 81.92 |   82.52   |  74.34  | 78.26 | 69.43 | 59.30 | 74.26 | 70.71 | 68.31 |
 | ElhBERTeu | 73.71 | 82.30 |   82.24   |  75.64  | 78.05 | 69.89 | 63.81 | 73.84 | 71.71 | 65.93 |

 To train ElhBERTeu, we collected different corpora sources from several domains: updated (2021) national and local news sources, Basque Wikipedia, as well as novel news sources and texts from other domains, such as science (both academic and divulgative), literature or subtitles. More details about the corpora used and their sizes are shown in the following table. Texts from news sources were oversampled (duplicated) as done during the training of BERTeus. In total 575M tokens were used for pre-training ElhBERTeu.
 |Domain     | Size     |
+|-----------|----------|
 |News       | 2 x 224M |
 |Wikipedia  | 40M      |
 |Science    | 58M      |
 The model has been evaluated on the recently created BasqueGLUE NLU benchmark:
+| Model     |  AVG  |  NERC |  F_intent | F_slot  |  BHTC |  BEC  |  Vaxx |  QNLI |  WiC  | coref |
 |-----------|:-----:|:-----:|:---------:|:-------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
+|           |       |   F1  |    F1     |   F1    |   F1  |   F1  |  MF1  |  acc  |  acc  |  acc  |
 | BERTeus   | 73.23 | 81.92 |   82.52   |  74.34  | 78.26 | 69.43 | 59.30 | 74.26 | 70.71 | 68.31 |
 | ElhBERTeu | 73.71 | 82.30 |   82.24   |  75.64  | 78.05 | 69.89 | 63.81 | 73.84 | 71.71 | 65.93 |