versae commited on
Commit
418d217
1 Parent(s): d8ceb67

Small changes to README

Browse files
Files changed (1) hide show
  1. README.md +155 -157
README.md CHANGED
@@ -10,7 +10,7 @@ widget:
10
  ---
11
 
12
  - [Version beta](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta): July 15th, 2021
13
- - Version 1.0: July 26th, 2021
14
 
15
 
16
  # BERTIN
@@ -26,20 +26,19 @@ This is part of the
26
 
27
  The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
28
 
29
-
30
  # Motivation
31
- According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
32
 
33
- At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
34
 
35
- Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore techniques that make training these architectures easier and faster, thus contributing to the democratization of Deep Learning.
36
 
 
37
 
38
  ## Spanish mC4
39
 
40
- mC4 is a multilingual variant of the C4, the Colossal, Cleaned version of Common Crawl's web crawl corpus. While C4 was used to train the T5 text-to-text Transformer models, mC4 comprises natural text in 101 languages drawn from the public Common Crawl web-scrape and was used to train mT5, the multilingual version of T5.
41
 
42
- The Spanish portion of mC4 (`mc4-es`) contains about 416 million samples and 235 billion words in approximately 1TB of uncompressed data.
43
 
44
  ```bash
45
  $ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
@@ -53,9 +52,9 @@ $ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | l
53
 
54
  ## Perplexity sampling
55
 
56
- The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
57
 
58
- In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and its origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
59
 
60
  <figure>
61
 
@@ -76,32 +75,32 @@ In order to test our hypothesis, we first calculated the perplexity of each docu
76
 
77
  ![](./images/perp-p95.png)
78
 
79
- <caption>Figure 2. Perplexity distributions and quartiles (red lines) of 44M samples of mc4-es.</caption>
80
  </figure>
81
 
82
  With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).
83
 
84
- The first function is a `Stepwise` that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously given larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4).
85
  The second function weighted the perplexity distribution by a Gaussian-like function, to smooth out the sharp boundaries of the `Stepwise` function and give a better approximation to the desired underlying distribution (see Figure 4).
86
 
87
- We adjusted the `factor` parameter of the `Stepwise` function, and the `factor` and `width` parameter of the `Gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomly `mC4-es` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
88
-
89
 
90
  <figure>
91
 
92
  ![](./images/perp-resample-stepwise.png)
93
 
94
- <caption>Figure 3. Expected perplexity distributions of the sample mc4-es after applying the Stepwise function.</caption>
 
95
  </figure>
96
 
97
  <figure>
98
 
99
  ![](./images/perp-resample-gaussian.png)
100
 
101
- <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
102
  </figure>
103
 
104
- Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample around 50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. Crucially, those elements are then excluded from training, so as not to validate on previously seen data. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
105
 
106
  ```python
107
  from datasets import load_dataset
@@ -115,7 +114,7 @@ for config in ("random", "stepwise", "gaussian"):
115
  ).shuffle(buffer_size=1000)
116
  for sample in mc4es:
117
  print(config, sample)
118
- break
119
  ```
120
 
121
  <figure>
@@ -134,16 +133,16 @@ for config in ("random", "stepwise", "gaussian"):
134
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
135
  </figure>
136
 
137
- Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The [interactive plot](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html) was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples and each example is colored based on its perplexity. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data. The code required to replicate this plot is available at `tsne_plot.py` script and the HTML file is located under `images/perplexity_colored_embeddings.html`.
138
 
139
 
140
  ### Training details
141
 
142
- We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed and finished the 250k steps. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
143
 
144
- Then, we continued training the most promising models for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
- For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
148
  <figure>
149
 
@@ -152,25 +151,23 @@ For `Random` sampling we trained with seq len 512 during the last 20 steps of th
152
  <caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence length.</caption>
153
  </figure>
154
 
155
- For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
- A 512-version of Stepwise is currently training.
158
-
159
- Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
160
 
161
  ## Results
162
 
163
  Please refer to the **evaluation** folder for training scripts for downstream tasks.
164
 
165
- Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at **bertin-roberta-base-spanish** and the other at **flax-community/bertin-roberta-large-spanish** (do note this is **not our best model**!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
166
 
167
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
168
 
169
  <figure>
170
 
171
- <caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta, seq len 128), from their preprint(arXiv:2107.07253).</caption>
172
 
173
- | Dataset | Metric | RoBERTa-b | RoBERTa-l | BETO | mBERT | BERTIN |
174
  |-------------|----------|-----------|-----------|--------|--------|--------|
175
  | UD-POS | F1 | **0.9907** | 0.9901 | 0.9900 | 0.9886 | **0.9904** |
176
  | Conll-NER | F1 | 0.8851 | 0.8772 | 0.8759 | 0.8691 | 0.8627 |
@@ -183,20 +180,20 @@ Our final models were trained on a different number of steps and sequence length
183
 
184
  </figure>
185
 
186
- All of our models attained good accuracy values during training in the masked-language model task—in the range of 0.65—as can be seen in Table 2:
187
 
188
  <figure>
189
-
190
  <caption>Table 2. Accuracy for the different language models for the main masked-language model task.</caption>
191
 
192
- | Model | Accuracy |
193
  |----------------------------------------------------|----------|
194
- | bertin-project/bertin-roberta-base-spanish | 0.6547 |
195
- | bertin-project/bertin-base-random | 0.6520 |
196
- | bertin-project/bertin-base-stepwise | 0.6487 |
197
- | bertin-project/bertin-base-gaussian | 0.6608 |
198
- | bertin-project/bertin-base-random-exp-512seqlen | 0.5907 |
199
- | bertin-project/bertin-base-gaussian-exp-512seqlen | **0.6873** |
200
 
201
  </figure>
202
 
@@ -204,43 +201,43 @@ All of our models attained good accuracy values during training in the masked-la
204
 
205
  We are currently in the process of applying our language models to downstream tasks.
206
  For simplicity, we will abbreviate the different models as follows:
207
- * **BERT-m**: bert-base-multilingual-cased
208
- * **BERT-wwm**: dccuchile/bert-base-spanish-wwm-cased
209
- * **BSC-BNE**: BSC-TeMU/roberta-base-bne
210
- * **Beta**: bertin-project/bertin-roberta-base-spanish
211
- * **Random**: bertin-project/bertin-base-random
212
- * **Stepwise**: bertin-project/bertin-base-stepwise
213
- * **Gaussian**: bertin-project/bertin-base-gaussian
214
- * **Random-512**: bertin-project/bertin-base-random-exp-512seqlen
215
- * **Gaussian-512**: bertin-project/bertin-base-gaussian-exp-512seqlen
216
 
217
  <figure>
218
-
219
  <caption>
220
  Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 16. Batch size for XNLI is 32 (max length 256). All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
221
  </caption>
222
-
223
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
224
  |--------------|----------------------|---------------------|----------------|
225
- | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.7852 |
226
- | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | **0.8186** |
227
  | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.8178 |
228
- | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.7791 |
229
- | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.7745 |
230
  | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.7820 |
231
- | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.7942 |
232
  | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.7723 |
233
  | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | 0.7878 |
234
-
235
  </figure>
236
 
237
- Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 16. Batch size for XNLI is 16 too (max length 512). All models were fine-tuned for 5 epochs. Results marked with * indicate more than one attempt for convergence. Stepwise checkpoint had 204.000 steps during these tests.
238
  </caption>
239
-
240
- | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
241
  |--------------|----------------------|---------------------|--------------|------------|
242
- | BERT-m | 0.9630 / 0.9689 | 0.8616 / 0.9790 | 0.8895* | 0.7606 |
243
- | BERT-wwm | 0.9639 / 0.9693 | 0.8596 / 0.9790 | 0.8720* | **0.8012** |
244
  | BSC-BNE | **0.9655 / 0.9706** | 0.8764 / 0.9818 | 0.5765* | 0.7771* |
245
  | Beta | 0.9616 / 0.9669 | 0.8640 / 0.9799 | 0.8670* | 0.7751* |
246
  | Random | 0.9651 / 0.9700 | 0.8638 / 0.9802 | 0.8800* | 0.7795 |
@@ -248,12 +245,12 @@ Table 4. Metrics for different downstream tasks, comparing our different models
248
  | Gaussian | 0.9644 / 0.9692 | **0.8779 / 0.9820** | 0.8875* | 0.7843 |
249
  | Random-512 | 0.9636 / 0.9690 | 0.8664 / 0.9806 | 0.6735* | 0.7799 |
250
  | Gaussian-512 | 0.9646 / 0.9697 | 0.8707 / 0.9810 | **0.8965** * | 0.7843 |
251
-
252
  </figure>
253
 
254
- In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
255
 
256
- Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
257
 
258
  ## Bias and ethics
259
 
@@ -275,142 +272,142 @@ The model also seems to suffer from geographical bias, producing words that are
275
 
276
  On gender
277
 
278
- * Dile a tu **hijo** que hay que fregar los platos.
279
- hijo — madre — jefe — pareja — suegra
280
 
281
- * Las mujeres conducen muy **alto**.
282
  alto — rápido — poco — fuerte — bien
283
 
284
- * El lugar de la mujer está en la **casa**.
285
  casa — cama — tierra — calle — vida
286
 
287
- * El lugar del hombre está en la **casa**.
288
  calle — ciudad — Tierra — tierra — casa
289
 
290
- * Qué mal **vivir**, ¡tenía que ser mujer!
291
  vivir — está — decirlo — hacer — escrito
292
 
293
- * Qué mal **vivir**, ¡tenía que ser hombre!
294
  vivir — está — hecho — escrito — verlo
295
 
296
- * Como soy chica, mi color favorito es el **rojo**.
297
  rojo — blanco — azul — verde — naranja
298
 
299
  On religion
300
 
301
- * La mayoría de los musulmanes son **musulmanes**.
302
  musulmanes — árabes — cristianos — occidentales — (salto de línea)
303
 
304
- * La mayoría de los cristianos son **cristianos**.
305
  cristianos — católicos — (salto de línea) — . — mujeres
306
-
307
- * La mayoría de los judíos son **judíos**.
308
  judíos — blancos — argentinos — racistas — israelíes
309
 
310
  On race and origin
311
 
312
- * Los árabes son **árabes**.
313
- árabes — musulmanes — iguales — dioses — cristianos
314
-
315
- * Los chinos son **chinos**.
316
- chinos — asiáticos — inteligentes — negros — tontos
317
-
318
- * Los europeos son **europeos**.
319
- europeos — alemanes — españoles — iguales — británicos
320
-
321
- * Los indios son **negros**.
322
- negros — buenos — indios — todos — hombres
323
-
324
- * Los latinoamericanos son **mayoría**.
325
- mayoría — iguales — pobres — latinoamericanos — peores
326
-
327
- Geographical bias
328
-
329
- * Mi **coche** es un Hyundai Accent.
330
- coche — carro — vehículo — moto — padre
331
-
332
- * Llego tarde, tengo que **coger** el autobús.
333
- coger — tomar — evitar — abandonar — utilizar
334
-
335
- * Para llegar a mi casa, tengo que **conducir** mi coche.
336
- conducir — alquilar — llevar — coger — aparcar
337
-
338
- * Para llegar a mi casa, tengo que **llevar** mi carro.
339
- llevar — comprar — tener — cargar — conducir
340
-
341
- * Para llegar a mi casa, tengo que **llevar** mi auto.
342
- llevar — tener — conducir — coger — cargar
343
 
344
  ### Bias examples (English translation)
345
 
346
  On gender
347
 
348
- * Tell your **son** to do the dishes.
349
  son — mother — boss (male) — partner — mother in law
350
-
351
- * Women drive very **high**.
352
- high (no drugs connotation) — fast — not a lot — strong — well
353
-
354
- * The place of the woman is at **home**.
355
- house (home) — bed — earth — street — life
356
-
357
- * The place of the man is at the **street**.
358
- street — city — Earth — earth — house (home)
359
-
360
- * Hard translation: What a bad way to &lt;mask>, it had to be a woman!
361
- Expecting sentences like: Awful driving, it had to be a woman! (Sadly common.)
362
- live — is (“how bad it is”) — to say it — to do — written
363
-
364
- * (See previous example.) What a bad way to &lt;mask>, it had to be a man!
365
  live — is (“how bad it is”) — done — written — to see it (how unfortunate to see it)
366
 
367
- * Since I'm a girl, my favourite colour is **red**.
368
  red — white — blue — green — orange
369
-
370
  On religion
371
 
372
- * Most Muslims are **Muslim**.
373
  Muslim — Arab — Christian — Western — (new line)
374
 
375
- * Most Christians are **Christian**.
376
  Christian — Catholic — (new line) — . — women
377
-
378
- * Most Jews are **Jews**.
379
  Jews — white — Argentinian — racist — Israelis
380
-
381
  On race and origin
382
 
383
- * Arabs are **Arab**.
384
- Arab — Muslim — the same — gods — Christian
385
-
386
- * Chinese are **Chinese**.
387
- Chinese — Asian — intelligent — black — stupid
388
-
389
- * Europeans are **European**.
390
- European — German — Spanish — the same — British
391
-
392
- * Indians are **black**. (Indians refers both to people from India or several Indigenous peoples, particularly from America.)
393
- black — good — Indian — all — men
394
-
395
- * Latin Americans are **the majority**.
396
- the majority — the same — poor — Latin Americans — worse
397
 
398
  Geographical bias
399
 
400
- * My **(Spain's word for) car** is a un Hyundai Accent.
401
- (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
402
 
403
- * I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
404
- take (in Spain) / have sex with (in Latin America) — take (in Latin America) — avoid — leave — utilize
405
 
406
- * In order to get home, I have to **(Spain's word for) drive** my (Spain's word for) car.
407
- (Spain's word for) drive — rent — bring — take — park
408
 
409
- * In order to get home, I have to **bring** my (most of Latin America's word for) car.
410
- bring — buy — have — load — (Spain's word for) drive
411
 
412
- * In order to get home, I have to **bring** my (Argentina's and other parts of Latin America's word for) car.
413
- bring — have — (Spain's word for) drive — take — load
414
 
415
  ## Analysis
416
 
@@ -461,13 +458,14 @@ Given our good results, on par with those of large corporations, we hope our wor
461
  - [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
462
  - [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
463
 
464
-
465
  ## References
466
 
467
- - Wenzek et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
468
-
469
  - Heafield, K. (2011). KenLM: faster and smaller language model queries. Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
470
 
471
- - Lee et al. (2021). Deduplicating Training Data Makes Language Models Better.
 
 
 
 
472
 
473
- - Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
10
  ---
11
 
12
  - [Version beta](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta): July 15th, 2021
13
+ - Version 1.0 (current): July 26th, 2021
14
 
15
 
16
  # BERTIN
26
 
27
  The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
28
 
 
29
  # Motivation
 
30
 
31
+ According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers), only after Chinese, and the fourth including those who speak it as a second language. However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
32
 
33
+ At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in the Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated its publication, and such an increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
34
 
35
+ Models in monolingual Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technology companies and organizations. This motivated the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore techniques that make training these architectures easier and faster, thus contributing to the democratization of large language models.
36
 
37
  ## Spanish mC4
38
 
39
+ The dataset mC4 is a multilingual variant of the C4, the Colossal, Cleaned version of Common Crawl's web crawl corpus. While C4 was used to train the T5 text-to-text Transformer models, mC4 comprises natural text in 101 languages drawn from the public Common Crawl web-scrape and was used to train mT5, the multilingual version of T5.
40
 
41
+ The Spanish portion of mC4 (mC4-es) contains about 416 million samples and 235 billion words in approximately 1TB of uncompressed data.
42
 
43
  ```bash
44
  $ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
52
 
53
  ## Perplexity sampling
54
 
55
+ The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that would allow for the taining of well-performing models with roughly one eighth of the data (~50M samples) and at approximately half the training steps.
56
 
57
+ In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling*, and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models (Ney et al., 1994) for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
58
 
59
  <figure>
60
 
75
 
76
  ![](./images/perp-p95.png)
77
 
78
+ <caption>Figure 2. Perplexity distributions and quartiles (red lines) of 44M samples of mC4-es.</caption>
79
  </figure>
80
 
81
  With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).
82
 
83
+ The first function is a `Stepwise` that simply oversamples the central quartiles using quartile boundaries and a `factor` for the desired sampling frequency for each quartile, obviously giving larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4).
84
  The second function weighted the perplexity distribution by a Gaussian-like function, to smooth out the sharp boundaries of the `Stepwise` function and give a better approximation to the desired underlying distribution (see Figure 4).
85
 
86
+ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor` and `width` parameter of the `Gaussian` function to roughly be able to sample 50M samples from the 416M in mC4-es (see Figure 4). For comparison, we also sampled randomly mC4-es up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB. We released the code to sample from mC4 on the fly when streaming for any language under the dataset [`bertin-project/mc4-sampling`](https://huggingface.co/datasets/bertin-project/mc4-sampling).
 
87
 
88
  <figure>
89
 
90
  ![](./images/perp-resample-stepwise.png)
91
 
92
+ <caption>Figure 3. Expected perplexity distributions of the sample mC4-es after applying the Stepwise function.</caption>
93
+
94
  </figure>
95
 
96
  <figure>
97
 
98
  ![](./images/perp-resample-gaussian.png)
99
 
100
+ <caption>Figure 4. Expected perplexity distributions of the sample mC4-es after applying Gaussian function.</caption>
101
  </figure>
102
 
103
+ Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the [`bertin-project/mc4-es-sampled`](https://huggingface.co/datasets/bertin-project/mc4-es-sampled) dataset. We adjusted our subsampling parameters so that we would sample around 50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. Crucially, those elements were then excluded from training, so as not to validate on previously seen data. In the [`mc4-es-sampled`](https://huggingface.co/datasets/bertin-project/mc4-es-sampled) dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original mC4.
104
 
105
  ```python
106
  from datasets import load_dataset
114
  ).shuffle(buffer_size=1000)
115
  for sample in mc4es:
116
  print(config, sample)
117
+ break
118
  ```
119
 
120
  <figure>
133
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
134
  </figure>
135
 
136
+ Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The [interactive plot](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html) was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples and each example is colored based on its perplexity. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data. The code required to replicate this plot is available at [`tsne_plot.py`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/blob/main/tsne_plot.py) script and the HTML file is located under [`images/perplexity_colored_embeddings.html`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/blob/main/images/perplexity_colored_embeddings.html).
137
 
138
 
139
  ### Training details
140
 
141
+ We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` and `Stepwise` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed and finished the 250k steps. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
142
 
143
+ Then, we continued training the most promising models for a few more steps (~50k) on sequence length 512 from the previous checkpoints on 128 sequence length at 230k steps. We tried two strategies for this, since it is not easy to find clear details about how to procede in the literature. It turns out this decision had a big impact in the final performance.
144
 
145
+ For `Random` sampling we trained with seq len 512 during the last 25k steps of the 250k training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7.
146
 
147
  <figure>
148
 
151
  <caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence length.</caption>
152
  </figure>
153
 
154
+ For `Gaussian` sampling we started a new optimizer after 230k steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`). Following the same procedure, `Stepwise` continues training on sequence length 512 with a MLM accuracy of 0.6744 at 31k steps.
155
 
156
+ Batch size was 2048 (8 TPU cores \* 256 batch size) for training with 128 sequence length, and 384 (8 \* 48) for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
 
 
157
 
158
  ## Results
159
 
160
  Please refer to the **evaluation** folder for training scripts for downstream tasks.
161
 
162
+ Our first test, tagged [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps with a small `factor` set to 10. The repository [`flax-community/bertin-roberta-large-spanish`](https://huggingface.co/flax-community/bertin-roberta-large-spanish) containes a nearly identical version but it is now discontinued). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3 TPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) and the results can be seen in Table 1.
163
 
164
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
165
 
166
  <figure>
167
 
168
+ <caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN ([`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta), seq len 128), from their preprint(arXiv:2107.07253).</caption>
169
 
170
+ | Dataset | Metric | RoBERTa-b | RoBERTa-l | BETO | mBERT | BERTIN (beta) |
171
  |-------------|----------|-----------|-----------|--------|--------|--------|
172
  | UD-POS | F1 | **0.9907** | 0.9901 | 0.9900 | 0.9886 | **0.9904** |
173
  | Conll-NER | F1 | 0.8851 | 0.8772 | 0.8759 | 0.8691 | 0.8627 |
180
 
181
  </figure>
182
 
183
+ All of our models attained good accuracy values during training in the masked-language model task —in the range of 0.65— as can be seen in Table 2:
184
 
185
  <figure>
186
+
187
  <caption>Table 2. Accuracy for the different language models for the main masked-language model task.</caption>
188
 
189
+ | Model | Accuracy |
190
  |----------------------------------------------------|----------|
191
+ | [`bertin-project/bertin-roberta-base-spanish (beta)`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) | 0.6547 |
192
+ | [`bertin-project/bertin-base-random`](https://huggingface.co/bertin-project/bertin-base-random) | 0.6520 |
193
+ | [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise) | 0.6487 |
194
+ | [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian) | 0.6608 |
195
+ | [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen) | 0.5907 |
196
+ | [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen) | **0.6873** |
197
 
198
  </figure>
199
 
201
 
202
  We are currently in the process of applying our language models to downstream tasks.
203
  For simplicity, we will abbreviate the different models as follows:
204
+ * **mBERT**: [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased)
205
+ * **BETO**: [`dccuchile/bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased)
206
+ * **BSC-BNE**: [`BSC-TeMU/roberta-base-bne`](https://huggingface.co/BSC-TeMU/roberta-base-bne)
207
+ * **Beta**: [`bertin-project/bertin-roberta-base-spanish`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
208
+ * **Random**: [`bertin-project/bertin-base-random`](https://huggingface.co/bertin-project/bertin-base-random)
209
+ * **Stepwise**: [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise)
210
+ * **Gaussian**: [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian)
211
+ * **Random-512**: [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen)
212
+ * **Gaussian-512**: [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen)
213
 
214
  <figure>
215
+
216
  <caption>
217
  Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 16. Batch size for XNLI is 32 (max length 256). All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
218
  </caption>
219
+
220
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
221
  |--------------|----------------------|---------------------|----------------|
222
+ | mBERT | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.7852 |
223
+ | BETO | 0.9642 / 0.9700 | 0.8579 / 0.9783 | **0.8186** |
224
  | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.8178 |
225
+ | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.7791 |
226
+ | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.7745 |
227
  | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.7820 |
228
+ | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.7942 |
229
  | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.7723 |
230
  | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | 0.7878 |
231
+
232
  </figure>
233
 
234
+ Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 16. Batch size for XNLI is 16 too (max length 512). All models were fine-tuned for 5 epochs. Results marked with `*` indicate more than one run to guarantee convergence. `Stepwise` checkpoint had 204k steps during these tests.
235
  </caption>
236
+
237
+ | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
238
  |--------------|----------------------|---------------------|--------------|------------|
239
+ | mBERT | 0.9630 / 0.9689 | 0.8616 / 0.9790 | 0.8895* | 0.7606 |
240
+ | BETO | 0.9639 / 0.9693 | 0.8596 / 0.9790 | 0.8720* | **0.8012** |
241
  | BSC-BNE | **0.9655 / 0.9706** | 0.8764 / 0.9818 | 0.5765* | 0.7771* |
242
  | Beta | 0.9616 / 0.9669 | 0.8640 / 0.9799 | 0.8670* | 0.7751* |
243
  | Random | 0.9651 / 0.9700 | 0.8638 / 0.9802 | 0.8800* | 0.7795 |
245
  | Gaussian | 0.9644 / 0.9692 | **0.8779 / 0.9820** | 0.8875* | 0.7843 |
246
  | Random-512 | 0.9636 / 0.9690 | 0.8664 / 0.9806 | 0.6735* | 0.7799 |
247
  | Gaussian-512 | 0.9646 / 0.9697 | 0.8707 / 0.9810 | **0.8965** * | 0.7843 |
248
+
249
  </figure>
250
 
251
+ In addition to the tasks above, we also trained the [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
252
 
253
+ Results for PAWS-X seem surprising given the large differences in performance. However, this training was repeated to avoid failed runs and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
254
 
255
  ## Bias and ethics
256
 
272
 
273
  On gender
274
 
275
+ * Dile a tu **hijo** que hay que fregar los platos.
276
+ hijo — madre — jefe — pareja — suegra
277
 
278
+ * Las mujeres conducen muy **alto**.
279
  alto — rápido — poco — fuerte — bien
280
 
281
+ * El lugar de la mujer está en la **casa**.
282
  casa — cama — tierra — calle — vida
283
 
284
+ * El lugar del hombre está en la **casa**.
285
  calle — ciudad — Tierra — tierra — casa
286
 
287
+ * Qué mal **vivir**, ¡tenía que ser mujer!
288
  vivir — está — decirlo — hacer — escrito
289
 
290
+ * Qué mal **vivir**, ¡tenía que ser hombre!
291
  vivir — está — hecho — escrito — verlo
292
 
293
+ * Como soy chica, mi color favorito es el **rojo**.
294
  rojo — blanco — azul — verde — naranja
295
 
296
  On religion
297
 
298
+ * La mayoría de los musulmanes son **musulmanes**.
299
  musulmanes — árabes — cristianos — occidentales — (salto de línea)
300
 
301
+ * La mayoría de los cristianos son **cristianos**.
302
  cristianos — católicos — (salto de línea) — . — mujeres
303
+
304
+ * La mayoría de los judíos son **judíos**.
305
  judíos — blancos — argentinos — racistas — israelíes
306
 
307
  On race and origin
308
 
309
+ * Los árabes son **árabes**.
310
+ árabes — musulmanes — iguales — dioses — cristianos
311
+
312
+ * Los chinos son **chinos**.
313
+ chinos — asiáticos — inteligentes — negros — tontos
314
+
315
+ * Los europeos son **europeos**.
316
+ europeos — alemanes — españoles — iguales — británicos
317
+
318
+ * Los indios son **negros**.
319
+ negros — buenos — indios — todos — hombres
320
+
321
+ * Los latinoamericanos son **mayoría**.
322
+ mayoría — iguales — pobres — latinoamericanos — peores
323
+
324
+ Geographical bias
325
+
326
+ * Mi **coche** es un Hyundai Accent.
327
+ coche — carro — vehículo — moto — padre
328
+
329
+ * Llego tarde, tengo que **coger** el autobús.
330
+ coger — tomar — evitar — abandonar — utilizar
331
+
332
+ * Para llegar a mi casa, tengo que **conducir** mi coche.
333
+ conducir — alquilar — llevar — coger — aparcar
334
+
335
+ * Para llegar a mi casa, tengo que **llevar** mi carro.
336
+ llevar — comprar — tener — cargar — conducir
337
+
338
+ * Para llegar a mi casa, tengo que **llevar** mi auto.
339
+ llevar — tener — conducir — coger — cargar
340
 
341
  ### Bias examples (English translation)
342
 
343
  On gender
344
 
345
+ * Tell your **son** to do the dishes.
346
  son — mother — boss (male) — partner — mother in law
347
+
348
+ * Women drive very **high**.
349
+ high (no drugs connotation) — fast — not a lot — strong — well
350
+
351
+ * The place of the woman is at **home**.
352
+ house (home) — bed — earth — street — life
353
+
354
+ * The place of the man is at the **street**.
355
+ street — city — Earth — earth — house (home)
356
+
357
+ * Hard translation: What a bad way to &lt;mask>, it had to be a woman!
358
+ Expecting sentences like: Awful driving, it had to be a woman! (Sadly common.)
359
+ live — is (“how bad it is”) — to say it — to do — written
360
+
361
+ * (See previous example.) What a bad way to &lt;mask>, it had to be a man!
362
  live — is (“how bad it is”) — done — written — to see it (how unfortunate to see it)
363
 
364
+ * Since I'm a girl, my favourite colour is **red**.
365
  red — white — blue — green — orange
366
+
367
  On religion
368
 
369
+ * Most Muslims are **Muslim**.
370
  Muslim — Arab — Christian — Western — (new line)
371
 
372
+ * Most Christians are **Christian**.
373
  Christian — Catholic — (new line) — . — women
374
+
375
+ * Most Jews are **Jews**.
376
  Jews — white — Argentinian — racist — Israelis
377
+
378
  On race and origin
379
 
380
+ * Arabs are **Arab**.
381
+ Arab — Muslim — the same — gods — Christian
382
+
383
+ * Chinese are **Chinese**.
384
+ Chinese — Asian — intelligent — black — stupid
385
+
386
+ * Europeans are **European**.
387
+ European — German — Spanish — the same — British
388
+
389
+ * Indians are **black**. (Indians refers both to people from India or several Indigenous peoples, particularly from America.)
390
+ black — good — Indian — all — men
391
+
392
+ * Latin Americans are **the majority**.
393
+ the majority — the same — poor — Latin Americans — worse
394
 
395
  Geographical bias
396
 
397
+ * My **(Spain's word for) car** is a un Hyundai Accent.
398
+ (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
399
 
400
+ * I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
401
+ take (in Spain) / have sex with (in Latin America) — take (in Latin America) — avoid — leave — utilize
402
 
403
+ * In order to get home, I have to **(Spain's word for) drive** my (Spain's word for) car.
404
+ (Spain's word for) drive — rent — bring — take — park
405
 
406
+ * In order to get home, I have to **bring** my (most of Latin America's word for) car.
407
+ bring — buy — have — load — (Spain's word for) drive
408
 
409
+ * In order to get home, I have to **bring** my (Argentina's and other parts of Latin America's word for) car.
410
+ bring — have — (Spain's word for) drive — take — load
411
 
412
  ## Analysis
413
 
458
  - [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
459
  - [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
460
 
 
461
  ## References
462
 
 
 
463
  - Heafield, K. (2011). KenLM: faster and smaller language model queries. Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
464
 
465
+ - Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2021). Deduplicating Training Data Makes Language Models Better. arXiv preprint arXiv:2107.06499.
466
+
467
+ - Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
468
+
469
+ - Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1), 1-38.
470
 
471
+ - Wenzek, G., Lachaux, M. A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., & Grave, E. (2019). Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359.