mapama247 commited on
Commit
3826b49
1 Parent(s): c6991a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -175,12 +175,13 @@ The dataset has the following language distribution:
175
  |Es|41.38%|
176
  |Ca|41.79%|
177
 
 
 
178
  ## Training procedure
179
 
180
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
181
  in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
182
- Once the model has been successfully initialized, we continued its pre-training in the three target languages: Catalan, Spanish, and English.
183
- We kept a small amount of English data in order to avoid catastrophic forgetting.
184
  The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
185
 
186
 
@@ -217,7 +218,7 @@ The Language Technologies Unit from Barcelona Supercomputing Center.
217
  For further information, please send an email to <langtech@bsc.es>.
218
 
219
  ### Copyright
220
- Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center.
221
 
222
  ### License
223
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
@@ -225,7 +226,7 @@ Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center.
225
  ### Funding
226
  This work was partially funded by:
227
  - The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
228
- - The [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan-TL](https://plantl.mineco.gob.es/Paginas/index.aspx).
229
 
230
  ### Disclaimer
231
 
 
175
  |Es|41.38%|
176
  |Ca|41.79%|
177
 
178
+ Note: We kept a small amount of English data in order to avoid catastrophic forgetting.
179
+
180
  ## Training procedure
181
 
182
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
183
  in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
184
+ After training a new tokenizer and adapting falcon-7b's embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
 
185
  The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
186
 
187
 
 
218
  For further information, please send an email to <langtech@bsc.es>.
219
 
220
  ### Copyright
221
+ Copyright (c) 2023 by Language Technologies Unit at Barcelona Supercomputing Center.
222
 
223
  ### License
224
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
226
  ### Funding
227
  This work was partially funded by:
228
  - The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
229
+ - The [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan de Impulso de las Tecnologías del Lenguaje](https://plantl.mineco.gob.es/Paginas/index.aspx).
230
 
231
  ### Disclaimer
232