mapama247 commited on
Commit
45201c6
1 Parent(s): 32d2a9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -20
README.md CHANGED
@@ -81,9 +81,9 @@ pipeline_tag: text-generation
81
  - [Training procedure](#training-procedure)
82
  - [Additional information](#additional-information)
83
  - [Author](#author)
84
- - [Contact information](#contact-information)
85
  - [Copyright](#copyright)
86
- - [Licensing information](#licensing-information)
87
  - [Funding](#funding)
88
  - [Disclaimer](#disclaimer)
89
 
@@ -91,12 +91,15 @@ pipeline_tag: text-generation
91
 
92
  ## Model description
93
 
94
- The **Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English. It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token trilingual corpus collected from publicly available corpora and crawlers.
 
 
95
 
96
 
97
  ## Intended uses and limitations
98
 
99
- The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks. However, it is intended to be fine-tuned on a generative downstream task.
 
100
 
101
  ## How to use
102
 
@@ -131,7 +134,9 @@ print(f"Result: {generation['generated_text']}")
131
  ```
132
 
133
  ## Limitations and bias
134
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
 
 
135
 
136
 
137
  ## Language adaptation
@@ -142,7 +147,7 @@ We adapted the original Falcon-7B model to Spanish and Catalan by swapping the t
142
 
143
  ### Training data
144
 
145
- The training corpus consists 26B tokens of several corpora gathered from web crawlings and public corpora.
146
 
147
  | Dataset | Language | Tokens (per-epoch) | Epochs |
148
  |---------------------|----------|--------------------|--------------|
@@ -170,21 +175,26 @@ The dataset has the following language distribution:
170
 
171
  ## Training procedure
172
 
173
- The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens. Once the model has been successfully initialized, we continued its pre-training in the three target languages: Catalan, Spanish, and English. We kept a small amount of English in order to avoid catastrophic forgetting. The training lasted a total of 320 hours with 8 NVIDIA H100 GPUs of 80GB of RAM.
 
 
 
 
174
 
175
 
176
  ### Training hyperparameters
177
 
178
- The following hyperparameters were used during training:
179
- - learning_rate: 5e-05
180
- - train_batch_size: 1
181
- - eval_batch_size: 1
182
  - seed: 42
183
  - distributed_type: multi-GPU
184
  - num_devices: 8
 
 
185
  - total_train_batch_size: 8
186
  - total_eval_batch_size: 8
187
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 
 
 
188
  - lr_scheduler_type: linear
189
  - num_epochs: 1.0
190
 
@@ -199,29 +209,31 @@ The following hyperparameters were used during training:
199
  ## Additional information
200
 
201
  ### Author
202
- Language Technologies Unir at the Barcelona Supercomputing Center (langtech@bsc.es).
203
 
204
- ### Contact information
205
- For further information, send an email to aina@bsc.es.
206
 
207
  ### Copyright
208
  Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center.
209
 
210
- ### Licensing information
211
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
212
 
213
  ### Funding
214
- This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina). This work was also partially funded by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the Plan-TL.
 
 
215
 
216
  ### Disclaimer
217
 
218
  <details>
219
  <summary>Click to expand</summary>
220
 
221
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
222
 
223
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
224
 
225
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
226
 
227
  </details>
 
81
  - [Training procedure](#training-procedure)
82
  - [Additional information](#additional-information)
83
  - [Author](#author)
84
+ - [Contact](#contact)
85
  - [Copyright](#copyright)
86
+ - [License](#license)
87
  - [Funding](#funding)
88
  - [Disclaimer](#disclaimer)
89
 
 
91
 
92
  ## Model description
93
 
94
+ The **Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English.
95
+ It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
96
+ trilingual corpus collected from publicly available corpora and crawlers.
97
 
98
 
99
  ## Intended uses and limitations
100
 
101
+ The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
102
+ However, it is intended to be fine-tuned on a generative downstream task.
103
 
104
  ## How to use
105
 
 
134
  ```
135
 
136
  ## Limitations and bias
137
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
138
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
139
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
140
 
141
 
142
  ## Language adaptation
 
147
 
148
  ### Training data
149
 
150
+ The training corpus consists 26B tokens of several corpora gathered from web crawlings and public domain data.
151
 
152
  | Dataset | Language | Tokens (per-epoch) | Epochs |
153
  |---------------------|----------|--------------------|--------------|
 
175
 
176
  ## Training procedure
177
 
178
+ The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
179
+ in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
180
+ Once the model has been successfully initialized, we continued its pre-training in the three target languages: Catalan, Spanish, and English.
181
+ We kept a small amount of English data in order to avoid catastrophic forgetting.
182
+ The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
183
 
184
 
185
  ### Training hyperparameters
186
 
 
 
 
 
187
  - seed: 42
188
  - distributed_type: multi-GPU
189
  - num_devices: 8
190
+ - train_batch_size: 1
191
+ - eval_batch_size: 1
192
  - total_train_batch_size: 8
193
  - total_eval_batch_size: 8
194
+ - optimizer: Adam
195
+ - betas=(0.9,0.999)
196
+ - epsilon=1e-08
197
+ - learning_rate: 5e-05
198
  - lr_scheduler_type: linear
199
  - num_epochs: 1.0
200
 
 
209
  ## Additional information
210
 
211
  ### Author
212
+ The Language Technologies Unit from Barcelona Supercomputing Center.
213
 
214
+ ### Contact
215
+ For further information, please send an email to <langtech@bsc.es>.
216
 
217
  ### Copyright
218
  Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center.
219
 
220
+ ### License
221
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
222
 
223
  ### Funding
224
+ This work was partially funded by:
225
+ - The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
226
+ - The [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan-TL](https://plantl.mineco.gob.es/Paginas/index.aspx).
227
 
228
  ### Disclaimer
229
 
230
  <details>
231
  <summary>Click to expand</summary>
232
 
233
+ The model published in this repository is intended for a generalist purpose and is available to third parties. This models may have bias and/or any other undesirable distortions.
234
 
235
+ When third parties deploy or provide systems and/or services to other parties using this model (or using systems based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
236
 
237
+ In no event shall the owner and creator of the model (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties.
238
 
239
  </details>