Fix typos and grammar
Browse files
README.md
CHANGED
@@ -28,11 +28,11 @@ The aim of this project was to pre-train a RoBERTa-base model from scratch durin
|
|
28 |
|
29 |
|
30 |
# Motivation
|
31 |
-
According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via
|
32 |
|
33 |
At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
|
34 |
|
35 |
-
Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore
|
36 |
|
37 |
|
38 |
## Spanish mC4
|
@@ -55,7 +55,7 @@ $ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | l
|
|
55 |
|
56 |
The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
|
57 |
|
58 |
-
In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and
|
59 |
|
60 |
<figure>
|
61 |
|
|
|
28 |
|
29 |
|
30 |
# Motivation
|
31 |
+
According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
|
32 |
|
33 |
At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
|
34 |
|
35 |
+
Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore techniques that make training these architectures easier and faster, thus contributing to the democratization of Deep Learning.
|
36 |
|
37 |
|
38 |
## Spanish mC4
|
|
|
55 |
|
56 |
The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
|
57 |
|
58 |
+
In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and its origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
|
59 |
|
60 |
<figure>
|
61 |
|