RoBERTa-ca Model Card

RoBERTa-ca is a new foundational Catalan language model built on the RoBERTa architecture. It uses vocabulary adaptation from mRoBERTa, a method that initializes all weights from mRoBERTa while applying a specialized treatment to the embedding matrix. This treatment carefully handles the differences between the two tokenizers. The model is then continually pretrained using a Catalan-only corpus, consisting of 95GB of high-quality Catalan data.

Technical Description

Technical details of the RoBERTa-ca model.

Description Value
Model Parameters 125M
Tokenizer Type SPM
Vocabulary size 50,304
Precision bfloat16
Context length 512

Training Hyperparemeters

Hyperparameter Value
Pretraining Objective Masked Language Modeling
Learning Rate 3E-05
Learning Rate Scheduler Cosine
Warmup 2425
Optimizer AdamW
Optimizer Hyperparameters AdamW (β1=0.9,β2=0.98,ε =1e-06 )
Optimizer Decay 1E-02
Global Batch Size 1024
Dropout 1E-01
Attention Dropout 1E-01
Activation Function GeLU

EVALUATION: CLUB Benchmark

Model performance in Catalan Language is assessed using the Catalan benchmark CLUB. CLUB (Catalan Language Understanding Benchmark) consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.

The following base foundational models have been considered for the comparison:

Multilingual Foundational Model Number of Parameters Vocab Size Description
BERTa 126M 52K BERTa is a Catalan-specific language model pretrained with Catalan-only data.
BERTinho 109M 30K BERTinho is monolingual BERT model for Galician language.
mBERT 178M 120K Multilingual BERT model pretrained on the top 104 languages with the largest Wikipedia.
mRoBERTa 283M 256K RoBERTa base model pretrained with 35 European languages and a larger vocabulary size.
roberta-base-bne 125M 50K RoBERTa base model pretrained with 570GB of data from web crawlings performed by the National Library of Spain from 2009 to 2019.
RoBERTa-ca 125M 50K RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa.
xlm-roberta-base 279M 250K Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.
xlm-roberta-large 561M 250K Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.
tasksroberta-base-bne (125M)berta (126M)mBERT (178M)xlm-roberta-base (279M)xlm-roberta-large (561M)roberta-ca (125M)mRoBERTa (283M)
ner (F1)87.5989.4785.8987.5089.4789.7088.33
pos (F1)98.6498.8998.7898.9199.0399.0098.98
sts (Person)74.2781.3977.0575.1183.4982.9979.52
tc (Acc.)73.8673.1672.0073.0574.1072.8172.41
te (Acc.)72.2780.1175.8678.2786.6382.1482.38
viquiquad (F1)82.5686.7487.4286.8190.3587.3187.86
xquad (F1)60.5667.3867.7268.5676.0870.5369.40

Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.

Funding

This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions through data contributions.

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

License

Apache License, Version 2.0

Downloads last month
482
Safetensors
Model size
125M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support