Update README.md
Browse files
README.md
CHANGED
@@ -1,22 +1,114 @@
|
|
1 |
---
|
2 |
language:
|
|
|
3 |
- es
|
|
|
4 |
license: apache-2.0
|
|
|
5 |
tags:
|
|
|
6 |
- "national library of spain"
|
|
|
7 |
- "spanish"
|
|
|
8 |
- "bne"
|
9 |
-
|
10 |
-
- "bne"
|
11 |
-
|
12 |
-
|
|
|
|
|
13 |
|
14 |
---
|
15 |
|
16 |
# GPT2-large trained with data from National Library of Spain (BNE)
|
17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
## Model Description
|
19 |
-
GPT2-large-bne is a transformer-based model for the Spanish language. It is based on the [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
## Training corpora and preprocessing
|
22 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
@@ -32,9 +124,6 @@ Some of the statistics of the corpus:
|
|
32 |
## Tokenization and pre-training
|
33 |
The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
|
34 |
|
35 |
-
## Evaluation and results
|
36 |
-
For evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish).
|
37 |
-
|
38 |
## Citing
|
39 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
40 |
```
|
|
|
1 |
---
|
2 |
language:
|
3 |
+
|
4 |
- es
|
5 |
+
|
6 |
license: apache-2.0
|
7 |
+
|
8 |
tags:
|
9 |
+
|
10 |
- "national library of spain"
|
11 |
+
|
12 |
- "spanish"
|
13 |
+
|
14 |
- "bne"
|
15 |
+
|
16 |
+
- "gpt2-large-bne"
|
17 |
+
|
18 |
+
widget:
|
19 |
+
- text: "El modelo del lenguaje GPT es capaz de"
|
20 |
+
- text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
|
21 |
|
22 |
---
|
23 |
|
24 |
# GPT2-large trained with data from National Library of Spain (BNE)
|
25 |
|
26 |
+
## Table of Contents
|
27 |
+
<details>
|
28 |
+
<summary>Click to expand</summary>
|
29 |
+
|
30 |
+
- [Model Description](#model-description)
|
31 |
+
- [Intended Uses and Limitations](#intended-uses-and-limitations)
|
32 |
+
- [How to Use](#how-to-use)
|
33 |
+
- [Limitations and bias](#limitations-and-bias)
|
34 |
+
- [Training corpora and preprocessing](#training-corpora-and-preprocessing)
|
35 |
+
- [Tokenization and pre-training](#tokenization-and-pre-training)
|
36 |
+
- [Citation Information](#citing)
|
37 |
+
- [Licensing Information](#licensing-information)
|
38 |
+
- [Copyright](#copyright)
|
39 |
+
- [Funding](#funding)
|
40 |
+
- [Disclaimer](#disclaimer)
|
41 |
+
|
42 |
+
</details>
|
43 |
+
|
44 |
## Model Description
|
45 |
+
**GPT2-large-bne** is a transformer-based model for the Spanish language. It is based on the [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
46 |
+
|
47 |
+
|
48 |
+
## Intended Uses and Limitations
|
49 |
+
|
50 |
+
You can use the raw model for text generation or fine-tune it to a downstream task.
|
51 |
+
|
52 |
+
### How to Use
|
53 |
+
|
54 |
+
Here is how to use this model:
|
55 |
+
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
|
56 |
+
|
57 |
+
```python
|
58 |
+
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
|
59 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
|
60 |
+
>>> model = AutoModelForCausalLM.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
|
61 |
+
>>> generator = pipeline('text-generation', tokenizer=tokenizer, model=model)
|
62 |
+
>>> set_seed(42)
|
63 |
+
>>> generator("La Biblioteca Nacional de España es una entidad pública y sus fines son", num_return_sequences=5)
|
64 |
+
|
65 |
+
[{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son servir como herramienta básica en la difusión de la cultura. '},
|
66 |
+
{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son el desarrollo de la educación, la cultura y el conocimiento, promoviendo actividades a través de Internet con la información que recibe del acceso a los fondos que en ella se almacenan. '},
|
67 |
+
{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son la publicación y difusión cultural. '},
|
68 |
+
{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son preservar y difundir los fondos y colecciones de la Biblioteca Nacional, así como servir de punto de encuentro para toda la comunidad científica, la academia y para la sociedad civil. '},
|
69 |
+
{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son la conservación, estudio y difusión del Patrimonio Bibliográfico en cualquiera de sus formas así como la formación y perfeccionamiento de los especialistas e investigadores en el campo de la información y de las bibliotecas.'}]
|
70 |
+
```
|
71 |
+
|
72 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
73 |
+
|
74 |
+
```python
|
75 |
+
>>> from transformers import AutoTokenizer, GPT2Model
|
76 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
|
77 |
+
>>> model = GPT2Model.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
|
78 |
+
>>> text = "La Biblioteca Nacional de España es una entidad pública y sus fines son"
|
79 |
+
>>> encoded_input = tokenizer(text, return_tensors='pt')
|
80 |
+
>>> output = model(**encoded_input)
|
81 |
+
>>> print(output.last_hidden_state.shape)
|
82 |
+
torch.Size([1, 14, 1280])
|
83 |
+
```
|
84 |
+
|
85 |
+
### Limitations and bias
|
86 |
+
|
87 |
+
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
|
88 |
+
unfiltered content from the internet, which is far from neutral. Here's an example of how the model can have biased predictions:
|
89 |
+
|
90 |
+
```python
|
91 |
+
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
|
92 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
|
93 |
+
>>> model = AutoModelForCausalLM.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
|
94 |
+
>>> generator = pipeline('text-generation', tokenizer=tokenizer, model=model)
|
95 |
+
>>> set_seed(42)
|
96 |
+
>>> generator("El hombre se dedica a", num_return_sequences=5)
|
97 |
+
[{'generated_text': 'El hombre se dedica a comprar móviles a sus padres, pero les paga por ellos y luego les devuelve la pasta a ella. '},
|
98 |
+
{'generated_text': 'El hombre se dedica a la venta ambulante ilegal en la zona de la Alameda, con puestos del rastro callejero o de supermercados a los que luego roba. '},
|
99 |
+
{'generated_text': 'El hombre se dedica a la venta ambulante en el Paseo de Melilla. '},
|
100 |
+
{'generated_text': 'El hombre se dedica a los tatuajes y los dibujos en el cuerpo con su apariencia física y no da a basto en las tareas domésticas. '},
|
101 |
+
{'generated_text': 'El hombre se dedica a la caza indiscriminada de animales. '}]
|
102 |
+
|
103 |
+
>>> set_seed(42)
|
104 |
+
>>> generator("La mujer se dedica a", num_return_sequences=5)
|
105 |
+
[{'generated_text': 'La mujer se dedica a comprar móviles a sus padres, pero les paga por ellos y luego no paga la factura." '},
|
106 |
+
{'generated_text': 'La mujer se dedica a la venta ambulante y su pareja vende cupones en el mercadillo navideño. '},
|
107 |
+
{'generated_text': 'La mujer se dedica a la venta al por mayor de perfumes, cosmética, complementos, y otros bienes de consumo. '},
|
108 |
+
{'generated_text': 'La mujer se dedica a los servicios sexuales y se aprovecha de los servicios religiosos. '},
|
109 |
+
{'generated_text': 'La mujer se dedica a la prostitución y tiene dos hijas del matrimonio y la propia familia de la víctima. '}]
|
110 |
+
|
111 |
+
```
|
112 |
|
113 |
## Training corpora and preprocessing
|
114 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
|
|
124 |
## Tokenization and pre-training
|
125 |
The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
|
126 |
|
|
|
|
|
|
|
127 |
## Citing
|
128 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
129 |
```
|