PlanTL-GOB-ES
/

gpt2-large-bne

@@ -1,22 +1,114 @@
 ---
 language:
 - es
 license: apache-2.0
 tags:
 - "national library of spain"
 - "spanish"
 - "bne"
-datasets:
-- "bne"
-metrics:
-- "ppl"
 ---
 # GPT2-large trained with data from National Library of Spain (BNE)
 ## Model Description
-GPT2-large-bne is a transformer-based model for the Spanish language. It is based on the [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the  [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
 ## Training corpora and preprocessing
 The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
@@ -32,9 +124,6 @@ Some of the statistics of the corpus:
 ## Tokenization and pre-training
 The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
-## Evaluation and results
-For evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish).
 ## Citing
 If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
 ```

 ---
 language:
 - es
 license: apache-2.0
 tags:
 - "national library of spain"
 - "spanish"
 - "bne"
+- "gpt2-large-bne"
+widget:
+- text: "El modelo del lenguaje GPT es capaz de"
+- text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
 ---
 # GPT2-large trained with data from National Library of Spain (BNE)
+## Table of Contents
+<details>
+<summary>Click to expand</summary>
+- [Model Description](#model-description)
+- [Intended Uses and Limitations](#intended-uses-and-limitations)
+   - [How to Use](#how-to-use)
+   - [Limitations and bias](#limitations-and-bias)
+- [Training corpora and preprocessing](#training-corpora-and-preprocessing)
+- [Tokenization and pre-training](#tokenization-and-pre-training)
+- [Citation Information](#citing)
+- [Licensing Information](#licensing-information)
+- [Copyright](#copyright)
+- [Funding](#funding)
+- [Disclaimer](#disclaimer)
+</details>
 ## Model Description
+**GPT2-large-bne** is a transformer-based model for the Spanish language. It is based on the [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the  [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
+## Intended Uses and Limitations
+You can use the raw model for text generation or fine-tune it to a downstream task.
+### How to Use
+Here is how to use this model:
+You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
+>>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
+>>> model = AutoModelForCausalLM.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
+>>> generator = pipeline('text-generation', tokenizer=tokenizer, model=model)
+>>> set_seed(42)
+>>> generator("La Biblioteca Nacional de España es una entidad pública y sus fines son", num_return_sequences=5)
+[{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son servir como herramienta básica en la difusión de la cultura. '},
+{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son el desarrollo de la educación, la cultura y el conocimiento, promoviendo actividades a través de Internet con la información que recibe del acceso a los fondos que en ella se almacenan. '},
+{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son la publicación y difusión cultural. '},
+{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son preservar y difundir los fondos y colecciones de la Biblioteca Nacional, así como servir de punto de encuentro para toda la comunidad científica, la academia y para la sociedad civil. '},
+{'generated_text': 'La Biblioteca Nacional de España es una entidad pública y sus fines son la conservación, estudio y difusión del Patrimonio Bibliográfico en cualquiera de sus formas así como la formación y perfeccionamiento de los especialistas e investigadores en el campo de la información y de las bibliotecas.'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+>>> from transformers import AutoTokenizer, GPT2Model
+>>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
+>>> model = GPT2Model.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
+>>> text = "La Biblioteca Nacional de España es una entidad pública y sus fines son"
+>>> encoded_input = tokenizer(text, return_tensors='pt')
+>>> output = model(**encoded_input)
+>>> print(output.last_hidden_state.shape)
+torch.Size([1, 14, 1280])
+```
+### Limitations and bias
+The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
+unfiltered content from the internet, which is far from neutral. Here's an example of how the model can have biased predictions:
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
+>>> tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
+>>> model = AutoModelForCausalLM.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
+>>> generator = pipeline('text-generation', tokenizer=tokenizer, model=model)
+>>> set_seed(42)
+>>> generator("El hombre se dedica a", num_return_sequences=5)
+[{'generated_text': 'El hombre se dedica a comprar móviles a sus padres, pero les paga por ellos y luego les devuelve la pasta a ella. '},
+{'generated_text': 'El hombre se dedica a la venta ambulante ilegal en la zona de la Alameda, con puestos del rastro callejero o de supermercados a los que luego roba. '},
+{'generated_text': 'El hombre se dedica a la venta ambulante en el Paseo de Melilla. '},
+{'generated_text': 'El hombre se dedica a los tatuajes y los dibujos en el cuerpo con su apariencia física y no da a basto en las tareas domésticas. '},
+{'generated_text': 'El hombre se dedica a la caza indiscriminada de animales. '}]
+>>> set_seed(42)
+>>> generator("La mujer se dedica a", num_return_sequences=5)
+[{'generated_text': 'La mujer se dedica a comprar móviles a sus padres, pero les paga por ellos y luego no paga la factura." '},
+{'generated_text': 'La mujer se dedica a la venta ambulante y su pareja vende cupones en el mercadillo navideño. '},
+{'generated_text': 'La mujer se dedica a la venta al por mayor de perfumes, cosmética, complementos, y otros bienes de consumo. '},
+{'generated_text': 'La mujer se dedica a los servicios sexuales y se aprovecha de los servicios religiosos. '},
+{'generated_text': 'La mujer se dedica a la prostitución y tiene dos hijas del matrimonio y la propia familia de la víctima. '}]
+```
 ## Training corpora and preprocessing
 The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
 ## Tokenization and pre-training
 The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
 ## Citing
 If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
 ```