Update README.md

Browse files

Files changed (1) hide show

README.md +24 -12

README.md CHANGED Viewed

@@ -146,7 +146,7 @@ In order to fully take advantage of the English Pre-Training of the original Fal
 ### Training data
-Once the model has been successfully initialized, we continue its pre-training in the two target languages: Catalan and Spanish. We also kept a small amount of English in order to avoid catastrophic forgetting. The composition of our 26B token dataset used to train this model is the following:
 | Dataset             | Language | Tokens (pre-epoch) | Epochs       |
 |---------------------|----------|--------------------|--------------|
@@ -164,7 +164,7 @@ Once the model has been successfully initialized, we continue its pre-training i
 | Wikipedia           | ca       |            228.01M |  3.570361212 |
 | Vilaweb             | ca       |             50.34M |  2.142216727 |
-The resulting dataset has the following language distribution:
 |Language|%|
 |---|---|
@@ -172,16 +172,11 @@ The resulting dataset has the following language distribution:
 |Es|41.38%|
 |Ca|41.79%|
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -203,9 +198,9 @@ The following hyperparameters were used during training:
 ![Validation Loss](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/validation_loss_condor.png)
 ![Accuracy](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/accuracy_condor.png)
-## Eval results
-It achieves the following results on the evaluation set:
 - Loss: 2.1504
 - Accuracy: 0.5258
@@ -214,4 +209,21 @@ It achieves the following results on the evaluation set:
 - Transformers 4.30.2
 - Pytorch 2.0.0
 - Datasets 2.13.1
-- Tokenizers 0.13.3

 ### Training data
+The training corpus consists 26B tokens of several corpora gathered from web crawlings and public corpora.
 | Dataset             | Language | Tokens (pre-epoch) | Epochs       |
 |---------------------|----------|--------------------|--------------|
 | Wikipedia           | ca       |            228.01M |  3.570361212 |
 | Vilaweb             | ca       |             50.34M |  2.142216727 |
+The dataset has the following language distribution:
 |Language|%|
 |---|---|
 |Es|41.38%|
 |Ca|41.79%|
+## Training procedure
+The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens. Once the model has been successfully initialized, we continued its pre-training in the three target languages: Catalan, Spanish, and English. We kept a small amount of English in order to avoid catastrophic forgetting. The training lasted a total of 96 hours with 8 NVIDIA H100 GPUs of 80GB of RAM.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 ![Validation Loss](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/validation_loss_condor.png)
 ![Accuracy](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/accuracy_condor.png)
+### Validation results
+It achieves the following results on the validation set:
 - Loss: 2.1504
 - Accuracy: 0.5258
 - Transformers 4.30.2
 - Pytorch 2.0.0
 - Datasets 2.13.1
+- Tokenizers 0.13.3
+## Additional information
+### Author
+Language Technologies Unir at the Barcelona Supercomputing Center (langtech@bsc.es)
+### Contact information
+For further information, send an email to aina@bsc.es
+### Copyright
+Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center
+### Licensing information
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina). This work was also partially funded by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the Plan-TL.