Edit model card

RoBERTa-ca-CaWikiTC

Overview

Click to expand
  • Model type: Language Model
  • Architecture: RoBERTa-base
  • Language: Catalan
  • License: Apache 2.0
  • Task: Zero-Shot Text Classification
  • Data: CaWikiTC

Model description

The roberta-base-ca-v2-cawikitc (RoBERTa-ca-CaWikiTC) is a Zero-Shot Text Classification model in Catalan created by fine-tuning RoBERTa-base-ca-v2 with a classification dataset, CaWikiTC, reformulated as entailment. This model was developed as part of the experimental research presented in the following paper "Entailment-based Task Transfer for Catalan Text Classification in Small Data Regimes".

Intended uses and limitations

This model can be used for zero-shot text classification in Catalan. It has been trained with a fixed hypothesis template, "Aquest article tracta sobre {}.", and Wikipedia-based articles as premises, and may not generalize well for all use cases.

How to use

from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="ibaucells/RoBERTa-ca-CaWikiTC")

sentence = "'Tierra firme' de Marqués-Marcet inaugura el Festival de cinema de Guadalajara amb Catalunya com a convidada d'honor. El director del film afirma sentir-se orgullós de formar part d'aquesta nova generació de cineastes catalans amb moltes dones directores."
candidate_labels = ["societat", "política", "cultura", "economia"]
template = "Aquest article tracta sobre {}."

output = classifier(sentence, candidate_labels, hypothesis_template=template, multi_label=False)

print(output)
print(f'Predicted class: {output["labels"][0]}')

Limitations and bias

No measures have been taken to estimate the bias and toxicity embedded in the model.

Training

Training data

This model was fine-tuned for the Natural Language Inference (NLI) task on an authomatically Wikipedia-based text classification dataset, CaWikiTC, reformulated as entailment. In the reformulation process, we generated two NLI examples for each text classification instance (text and label): an entailment example and a non-entailment example. In both cases, we employed the text as the premise and utilized a shared template to create the hypothesis ("Aquest article tracta {}."), which was completed with the correct label for the entailment example and a randomly-selected label from the remaining options for the non-entailment example.

Training procedure

The pre-trained Catalan model RoBERTa-base-ca-v2 was fine-tuned with the training data using a learning rate of 3e-5, a batch size of 16, seed 26 and a maximum of 10 epochs. The development set (converted into entailment) was used to select the best checkpoint according to the highest weighted F1 score in the classification task, which was obtained in the first epoch.

Evaluation

Evaluation results

This model was evaluated for the TeCla zero-shot text classification task (without specific fine-tuning for the task) and obtained weighted F1 scores of 75.0 in the coarse-grained task (4 classes) and 49.1 in the fine-grained task (53 classes).

Additional information

Contact

For further information, send an email to either irene.baucells@bsc.es.

License

This work is distributed under a Apache License, Version 2.0.

Funding

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA.

Citation

Disclaimer

Click to expand The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the models be liable for any results arising from the use made by third parties of these models.

Downloads last month
239
Safetensors
Model size
125M params
Tensor type
I64
·
F32
·

Collection including projecte-aina/roberta-base-ca-v2-cawikitc