File size: 5,242 Bytes
49f4ab0
 
 
 
 
 
 
 
 
efc0858
49f4ab0
 
 
 
b6e3af1
49f4ab0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a65ef6
49f4ab0
 
 
 
 
 
 
 
 
 
 
 
efc0858
49f4ab0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
pipeline_tag: zero-shot-classification
license: apache-2.0
language:
- ca
tags:
- zero-shot
- text-classification
widget:
  - text: "'Tierra firme' de Marqués-Marcet inaugura el Festival de cinema de Guadalajara amb Catalunya com a convidada d'honor. El director del film afirma sentir-se orgullós de formar part d'aquesta nova generació de cineastes catalans amb moltes dones directores."
    candidate_labels: societat, política, cultura, economia
    multi_class: true
    hypothesis_template: Aquest article tracta sobre {}.
---

# RoBERTa-ca-CaWikiTC

## Overview

<details>
<summary>Click to expand</summary>
  
- **Model type:** Language Model
- **Architecture:** RoBERTa-base
- **Language:** Catalan
- **License:** Apache 2.0
- **Task:** Zero-Shot Text Classification
- **Data:** CaWikiTC
</details>

## Model description

The **roberta-base-ca-v2-cawikitc** (RoBERTa-ca-CaWikiTC) is a Zero-Shot Text Classification model in Catalan created by fine-tuning [RoBERTa-base-ca-v2](https://huggingface.co/projecte-aina/roberta-large-ca-v2) with a classification dataset, CaWikiTC, reformulated as entailment. This model was developed as part of the experimental research presented in the following paper ["Entailment-based Task Transfer for Catalan Text Classification in Small Data Regimes"]().

## Intended uses and limitations

This model can be used for zero-shot text classification in Catalan. It has been trained with a fixed hypothesis template, "Aquest article tracta sobre {}.", and Wikipedia-based articles as premises, and may not generalize well for all use cases.

## How to use

```python
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="ibaucells/RoBERTa-ca-CaWikiTC")

sentence = "'Tierra firme' de Marqués-Marcet inaugura el Festival de cinema de Guadalajara amb Catalunya com a convidada d'honor. El director del film afirma sentir-se orgullós de formar part d'aquesta nova generació de cineastes catalans amb moltes dones directores."
candidate_labels = ["societat", "política", "cultura", "economia"]
template = "Aquest article tracta sobre {}."

output = classifier(sentence, candidate_labels, hypothesis_template=template, multi_label=False)

print(output)
print(f'Predicted class: {output["labels"][0]}')
```

## Limitations and bias

No measures have been taken to estimate the bias and toxicity embedded in the model.

## Training

### Training data

This model was fine-tuned for the Natural Language Inference (NLI) task on an authomatically Wikipedia-based text classification dataset, [CaWikiTC](https://huggingface.co/ibaucells/CaWikiTC), reformulated as entailment. In the reformulation process, we generated two NLI examples for each text classification instance (text and label): an entailment example and a non-entailment example. In both cases, we employed the text as the premise and utilized a shared template to create the hypothesis ("Aquest article tracta {}."), which was completed with the correct label for the entailment example and a randomly-selected label from the remaining options for the non-entailment example.

### Training procedure

The pre-trained Catalan model [RoBERTa-base-ca-v2](https://huggingface.co/projecte-aina/roberta-large-ca-v2) was fine-tuned with the training data using a learning rate of 3e-5, a batch size of 16, seed 26 and a maximum of 10 epochs. The development set (converted into entailment) was used to select the best checkpoint according to the highest weighted F1 score in the classification task, which was obtained in the first epoch.

## Evaluation

### Evaluation results

This model was evaluated for the TeCla zero-shot text classification task (without specific fine-tuning for the task) and obtained weighted F1 scores of 75.0 in the coarse-grained task (4 classes) and 49.1 in the fine-grained task (53 classes).

## Additional information

### Contact

For further information, send an email to either <irene.baucells@bsc.es>.

### License

This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

### Funding

This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).

### Citation

### Disclaimer

<details>
<summary>Click to expand</summary>
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the models be liable for any results arising from the use made by third parties of these models.
</details>