gonzalez-agirre commited on
Commit
1abca30
1 Parent(s): 55c25fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -15,13 +15,17 @@ tags:
15
 
16
  - "gpt2-base-bne"
17
 
 
 
 
 
18
  widget:
19
  - text: "El modelo del lenguaje GPT es capaz de"
20
  - text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
21
 
22
  ---
23
 
24
- # GPT2-base (gpt2-base-bne) trained with data from National Library of Spain (BNE)
25
 
26
  ## Table of Contents
27
  <details>
@@ -48,7 +52,7 @@ widget:
48
 
49
  ## Overview
50
 
51
- - **Architecture:** gpt2-base-bne
52
  - **Language:** Spanish
53
  - **Task:** text-generation
54
  - **Data:** BNE
@@ -96,8 +100,7 @@ torch.Size([1, 14, 768])
96
 
97
  ## Limitations and bias
98
 
99
- The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
100
- unfiltered content from the internet, which is far from neutral. Here's an example of how the model can have biased predictions:
101
 
102
  ```python
103
  >>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
@@ -128,7 +131,7 @@ unfiltered content from the internet, which is far from neutral. Here's an examp
128
 
129
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
130
 
131
- To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among the others, sentence splitting, language detection, filtering of bad-formed sentences and deduplication of repetitive contents. During the process document boundaries are kept. This resulted into 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting into 570GB of text.
132
 
133
  Some of the statistics of the corpus:
134
 
15
 
16
  - "gpt2-base-bne"
17
 
18
+ datasets:
19
+
20
+ - "bne"
21
+
22
  widget:
23
  - text: "El modelo del lenguaje GPT es capaz de"
24
  - text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
25
 
26
  ---
27
 
28
+ # GPT2-base (gpt2-base-bne) trained with data from the National Library of Spain (BNE)
29
 
30
  ## Table of Contents
31
  <details>
52
 
53
  ## Overview
54
 
55
+ - **Architecture:** gpt2-base
56
  - **Language:** Spanish
57
  - **Task:** text-generation
58
  - **Data:** BNE
100
 
101
  ## Limitations and bias
102
 
103
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. Nevertheless, here's an example of how the model can have biased predictions:
 
104
 
105
  ```python
106
  >>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
131
 
132
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
133
 
134
+ To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among others, sentence splitting, language detection, filtering of bad-formed sentences, and deduplication of repetitive contents. During the process, document boundaries are kept. This resulted in 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting in 570GB of text.
135
 
136
  Some of the statistics of the corpus:
137