BSC-LT
/

roberta-base-bne

national library of spain

Inference Endpoints

Model card Files Files and versions Community

asier-gutierrez commited on Aug 2, 2021

Commit

82fd9d5

•

1 Parent(s): eea2436

Update README.md

Files changed (1) hide show

README.md +10 -1

README.md CHANGED Viewed

@@ -20,6 +20,12 @@ widget:
 # RoBERTa base trained with data from National Library of Spain (BNE)
 ## Citing
 Check out our paper for all the details: https://arxiv.org/abs/2107.07253
@@ -34,4 +40,7 @@ Check out our paper for all the details: https://arxiv.org/abs/2107.07253
 }
 ```
-For more information visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish)

 # RoBERTa base trained with data from National Library of Spain (BNE)
+## Introduction
+This work presents the Spanish RoBERTa-base model. The model has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.
+## Evaluation
+For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
 ## Citing
 Check out our paper for all the details: https://arxiv.org/abs/2107.07253
 }
 ```
+## Corpora
+| Corpora | Number of documents | Size (GB) |
+|---------|---------------------|-----------|
+| BNE     |         201,080,084 |     570GB |