File size: 2,055 Bytes
d988382
02513c3
d988382
 
 
 
 
 
3f4b8d4
d988382
 
af3cb2c
d988382
e14e482
af3cb2c
 
 
 
d988382
 
 
 
 
 
 
 
 
 
 
 
 
 
af3cb2c
d988382
02513c3
 
 
 
 
 
af3cb2c
 
d988382
af3cb2c
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
language: es
license: CC-BY 4.0
tags:
- spanish
- roberta
pipeline_tag: fill-mask
widget:
- text: "Fui a la librería a comprar un <mask>."
---

# BERTIN

BERTIN is a series of BERT-based models for Spanish. This one is a RoBERTa-large model trained from scratch on the Spanish portion of mC4 using [Flax](https://github.com/google/flax), including training scripts.

This is part of the
[Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organised by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.

## Spanish mC4

The Spanish portion of mC4 containes about 416 million records and 235 billion words.

```bash
$ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
416057992
```

```bash
$ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | length' | paste -s -d+ - | bc
235303687795
```

## Team members

- Javier de la Rosa ([versae](https://huggingface.co/versae))
- Eduardo González ([edugp](https://huggingface.co/edugp))
- Paulo Villegas ([paulo](https://huggingface.co/paulo))
- Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
- Manu Romero ([mrm8488](https://huggingface.co/))
- María Grandury ([mariagrandury](https://huggingface.co/))

## Useful links

- [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
- [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
- [Community Week thread](https://discuss.huggingface.co/t/bertin-pretrain-roberta-large-from-scratch-in-spanish/7125)
- [Community Week channel](https://discord.com/channels/858019234139602994/859113060068229190)
- [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
- [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)