joanllop commited on
Commit
f23e259
1 Parent(s): 2ce8fbb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -8
README.md CHANGED
@@ -1,22 +1,114 @@
1
  ---
2
  language:
 
3
  - es
 
4
  license: apache-2.0
 
5
  tags:
 
6
  - "national library of spain"
 
7
  - "spanish"
 
8
  - "bne"
9
- datasets:
10
- - "bne"
11
- metrics:
12
- - "ppl"
 
 
13
 
14
  ---
15
 
16
  # GPT2-large trained with data from National Library of Spain (BNE)
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ## Model Description
19
- GPT2-large-bne is a transformer-based model for the Spanish language. It is based on the [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Training corpora and preprocessing
22
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
@@ -32,9 +124,6 @@ Some of the statistics of the corpus:
32
  ## Tokenization and pre-training
33
  The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
34
 
35
- ## Evaluation and results
36
- For evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish).
37
-
38
  ## Citing
39
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
40
  ```
 
1
  ---
2
  language:
3
+
4
  - es
5
+
6
  license: apache-2.0
7
+
8
  tags:
9
+
10
  - "national library of spain"
11
+
12
  - "spanish"
13
+
14
  - "bne"
15
+
16
+ - "gpt2-large-bne"
17
+
18
+ widget:
19
+ - text: "El modelo del lenguaje GPT es capaz de"
20
+ - text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
21
 
22
  ---
23
 
24
  # GPT2-large trained with data from National Library of Spain (BNE)
25
 
26
+ ## Table of Contents
27
+ <details>
28
+ <summary>Click to expand</summary>
29
+
30
+ - [Model Description](#model-description)
31
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
32
+ - [How to Use](#how-to-use)
33
+ - [Limitations and bias](#limitations-and-bias)
34
+ - [Training corpora and preprocessing](#training-corpora-and-preprocessing)
35
+ - [Tokenization and pre-training](#tokenization-and-pre-training)
36
+ - [Citation Information](#citing)
37
+ - [Licensing Information](#licensing-information)
38
+ - [Copyright](#copyright)
39
+ - [Funding](#funding)
40
+ - [Disclaimer](#disclaimer)
41
+
42
+ </details>
43
+
44
  ## Model Description
45
+ **GPT2-large-bne** is a transformer-based model for the Spanish language. It is based on the [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
46
+
47
+
48
+ ## Intended Uses and Limitations
49
+
50
+ You can use the raw model for text generation or fine-tune it to a downstream task.
51
+
52
+ ### How to Use
53
+
54
+ Here is how to use this model:
55
+ You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
56
+
57
+ ```python
58
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
59
+ >>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
60
+ >>> model = AutoModelForCausalLM.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
61
+ >>> generator = pipeline('text-generation', tokenizer=tokenizer, model=model)
62
+ >>> set_seed(42)
63
+ >>> generator("La Biblioteca Nacional de España es una entidad pública y sus fines son", num_return_sequences=5)
64
+
65
+ [{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son servir como herramienta básica en la difusión de la cultura. '},
66
+ {'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son el desarrollo de la educación, la cultura y el conocimiento, promoviendo actividades a través de Internet con la información que recibe del acceso a los fondos que en ella se almacenan. '},
67
+ {'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son la publicación y difusión cultural. '},
68
+ {'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son preservar y difundir los fondos y colecciones de la Biblioteca Nacional, así como servir de punto de encuentro para toda la comunidad científica, la academia y para la sociedad civil. '},
69
+ {'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son la conservación, estudio y difusión del Patrimonio Bibliográfico en cualquiera de sus formas así como la formación y perfeccionamiento de los especialistas e investigadores en el campo de la información y de las bibliotecas.'}]
70
+ ```
71
+
72
+ Here is how to use this model to get the features of a given text in PyTorch:
73
+
74
+ ```python
75
+ >>> from transformers import AutoTokenizer, GPT2Model
76
+ >>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
77
+ >>> model = GPT2Model.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
78
+ >>> text = "La Biblioteca Nacional de España es una entidad pública y sus fines son"
79
+ >>> encoded_input = tokenizer(text, return_tensors='pt')
80
+ >>> output = model(**encoded_input)
81
+ >>> print(output.last_hidden_state.shape)
82
+ torch.Size([1, 14, 1280])
83
+ ```
84
+
85
+ ### Limitations and bias
86
+
87
+ The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
88
+ unfiltered content from the internet, which is far from neutral. Here's an example of how the model can have biased predictions:
89
+
90
+ ```python
91
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
92
+ >>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
93
+ >>> model = AutoModelForCausalLM.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
94
+ >>> generator = pipeline('text-generation', tokenizer=tokenizer, model=model)
95
+ >>> set_seed(42)
96
+ >>> generator("El hombre se dedica a", num_return_sequences=5)
97
+ [{'generated_text': 'El hombre se dedica a comprar móviles a sus padres, pero les paga por ellos y luego les devuelve la pasta a ella. '},
98
+ {'generated_text': 'El hombre se dedica a la venta ambulante ilegal en la zona de la Alameda, con puestos del rastro callejero o de supermercados a los que luego roba. '},
99
+ {'generated_text': 'El hombre se dedica a la venta ambulante en el Paseo de Melilla. '},
100
+ {'generated_text': 'El hombre se dedica a los tatuajes y los dibujos en el cuerpo con su apariencia física y no da a basto en las tareas domésticas. '},
101
+ {'generated_text': 'El hombre se dedica a la caza indiscriminada de animales. '}]
102
+
103
+ >>> set_seed(42)
104
+ >>> generator("La mujer se dedica a", num_return_sequences=5)
105
+ [{'generated_text': 'La mujer se dedica a comprar móviles a sus padres, pero les paga por ellos y luego no paga la factura." '},
106
+ {'generated_text': 'La mujer se dedica a la venta ambulante y su pareja vende cupones en el mercadillo navideño. '},
107
+ {'generated_text': 'La mujer se dedica a la venta al por mayor de perfumes, cosmética, complementos, y otros bienes de consumo. '},
108
+ {'generated_text': 'La mujer se dedica a los servicios sexuales y se aprovecha de los servicios religiosos. '},
109
+ {'generated_text': 'La mujer se dedica a la prostitución y tiene dos hijas del matrimonio y la propia familia de la víctima. '}]
110
+
111
+ ```
112
 
113
  ## Training corpora and preprocessing
114
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
 
124
  ## Tokenization and pre-training
125
  The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
126
 
 
 
 
127
  ## Citing
128
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
129
  ```