projecte-aina
/

roberta-base-ca-v2-cawikitc

@@ -1,3 +1,101 @@
-models/roberta-base-ca-v2/wiki_no_teca/output/roberta-base-ca-v2/tecla_nli.py_8_0.00003_date_22-12-21_time_01-40-19/checkpoint-2494
-In order to train the two entailment models, RoBERTa- ca-Wikicorpus and RoBERTa-ca-Teca-Wikicorpus, we first transformed the Wikicorpus dataset into the entailment format. In this process, to generate the hypothesis, we employed the same approach outlined in Section 3.1, utilizing the seventh template from Table 6 (“Aquest article tracta sobre {label}.”), which yielded consistently strong results across both the coarse-grained and the fine-grained tasks for our model. To balance the proportion of entailment and non-entailment hypotheses and avoid the computational cost of multiplying the dataset size by 67 classes (which would be the case if we generated all possible non- entailment hypotheses for each entailment hypothesis), we decided to generate one non- entailment per each entailment hypothesis, thereby only increasing the dataset size by a factor of two. For the training of the entailment models with Wikicorpus, we kept the configurations used in the main few-shot experiments previously presented in this section: we selected the best checkpoint according to the weighted F1 score in the classification task and kept the same fixed hyperparameters.

+---
+pipeline_tag: zero-shot-classification
+license: apache-2.0
+language:
+- ca
+tags:
+- zero-shot
+- text-classification
+widget:
+  - text: Albert Serra estrenarà a Catalunya la videoinstal·lació 'Personalien' dins del Temporada Alta. El festival programa vuit pel·lícules de cinema per aquesta edició.
+    candidate_labels: societat, política, cultura, economia
+    multi_class: true
+    hypothesis_template: Aquest article tracta sobre {}.
+---
+# RoBERTa-ca-CaWikiTC
+## Overview
+<details>
+<summary>Click to expand</summary>
+- **Model type:** Language Model
+- **Architecture:** RoBERTa-base
+- **Language:** Catalan
+- **License:** Apache 2.0
+- **Task:** Zero-Shot Text Classification
+- **Data:** CaWikiTC
+</details>
+## Model description
+RoBERTa-ca-CaWikiTC is a Zero-Shot Text Classification model in Catalan created by fine-tuning [RoBERTa-base-ca-v2](https://huggingface.co/projecte-aina/roberta-large-ca-v2) with a classification dataset, CaWikiTC, reformulated as entailment. This model was developed as part of the experimental research presented in the following paper ["Entailment-based Task Transfer for Catalan Text Classification in Small Data Regimes"]().
+## Intended uses and limitations
+This model can be used for zero-shot text classification in Catalan. It has been trained with a fixed hypothesis template, "Aquest article tracta sobre {}.", and Wikipedia-based articles as premises, and may not generalize well for all use cases.
+## How to use
+```python
+from transformers import pipeline
+classifier = pipeline("zero-shot-classification", model="ibaucells/RoBERTa-ca-CaWikiTC")
+sentence = "Albert Serra estrenarà a Catalunya la videoinstal·lació 'Personalien' dins del Temporada Alta. El festival programa vuit pel·lícules de cinema per aquesta edició."
+candidate_labels = ["societat", "política", "cultura", "economia"]
+template = "Aquest article tracta sobre {}."
+output = classifier(sentence, candidate_labels, hypothesis_template=template, multi_label=False)
+print(output)
+print(f'Predicted class: {output["labels"][0]}')
+```
+## Limitations and bias
+No measures have been taken to estimate the bias and toxicity embedded in the model.
+## Training
+### Training data
+This model was fine-tuned for the Natural Language Inference (NLI) task on an authomatically Wikipedia-based text classification dataset, [CaWikiTC](https://huggingface.co/ibaucells/CaWikiTC), reformulated as entailment. In the reformulation process, we generated two NLI examples for each text classification instance (text and label): an entailment example and a non-entailment example. In both cases, we employed the text as the premise and utilized a shared template to create the hypothesis ("Aquest article tracta {}."), which was completed with the correct label for the entailment example and a randomly-selected label from the remaining options for the non-entailment example.
+### Training procedure
+The pre-trained Catalan model [RoBERTa-base-ca-v2](https://huggingface.co/projecte-aina/roberta-large-ca-v2) was fine-tuned with the training data using a learning rate of 3e-5, a batch size of 16, seed 26 and a maximum of 10 epochs. The development set (converted into entailment) was used to select the best checkpoint according to the highest weighted F1 score in the classification task, which was obtained in the first epoch.
+## Evaluation
+### Evaluation results
+This model was evaluated for the TeCla zero-shot text classification task (without specific fine-tuning for the task) and obtained weighted F1 scores of 75.0 in the coarse-grained task (4 classes) and 49.1 in the fine-grained task (53 classes).
+## Additional information
+### Contact
+For further information, send an email to either <irene.baucells@bsc.es>.
+### License
+This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+### Funding
+This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
+### Citation
+### Disclaimer
+<details>
+<summary>Click to expand</summary>
+The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
+When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
+In no event shall the owner and creator of the models be liable for any results arising from the use made by third parties of these models.
+</details>