bertin-project
/

bertin-roberta-base-spanish

@@ -15,7 +15,7 @@ widget:
 # Motivation
 According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilanguage versions which are not as performant as the English alternative.
-At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center  released their own [RoBERTa] (https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
 Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore technieque that make training this architectures easier and faster, thus contributing to the democratization of Deep Learning.
@@ -190,6 +190,18 @@ All of our models attained good accuracy values, in the range of 0.65, as can be
 We are currently in the process of applying our language models to downstream tasks.
 ## SQUAD-es
 Using sequence length 128 we have achieved exact match 50.96 and F1 68.74.

 # Motivation
 According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilanguage versions which are not as performant as the English alternative.
+At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center  released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
 Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore technieque that make training this architectures easier and faster, thus contributing to the democratization of Deep Learning.
 We are currently in the process of applying our language models to downstream tasks.
+<figure>
+<caption>Table x.</caption>
+| Dataset   |    Metric    | BERT-m | BERT-wwm | BSC-BNE  |  Beta  | Random  |  Stepwise  |  Gaussian |  Random-512  |  Gaussian-512 |
+|----------|---------------|--------|----------|----------|--------|---------|------------|-----------|--------------|---------------|
+| CoNLL 2002-POS | F1      | 0.9629 | 0.9642   | 0.9659   | 0.9638 | 0.9656  | 0.9656     | **0.9662** | 0.9660      | **0.9662**    |
+| CoNLL 2002-POS | F1      | 0.9687 | 0.9700   | 0.9707   | 0.9690 | 0.9704  | 0.9707     | 0.9709     | 0.9707      | **0.9714**    |
+</figure>
 ## SQUAD-es
 Using sequence length 128 we have achieved exact match 50.96 and F1 68.74.