ibaucells commited on
Commit
49f4ab0
1 Parent(s): b6e3af1

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -2
README.md CHANGED
@@ -1,3 +1,101 @@
1
- models/roberta-base-ca-v2/wiki_no_teca/output/roberta-base-ca-v2/tecla_nli.py_8_0.00003_date_22-12-21_time_01-40-19/checkpoint-2494
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- In order to train the two entailment models, RoBERTa- ca-Wikicorpus and RoBERTa-ca-Teca-Wikicorpus, we first transformed the Wikicorpus dataset into the entailment format. In this process, to generate the hypothesis, we employed the same approach outlined in Section 3.1, utilizing the seventh template from Table 6 (“Aquest article tracta sobre {label}.”), which yielded consistently strong results across both the coarse-grained and the fine-grained tasks for our model. To balance the proportion of entailment and non-entailment hypotheses and avoid the computational cost of multiplying the dataset size by 67 classes (which would be the case if we generated all possible non- entailment hypotheses for each entailment hypothesis), we decided to generate one non- entailment per each entailment hypothesis, thereby only increasing the dataset size by a factor of two. For the training of the entailment models with Wikicorpus, we kept the configurations used in the main few-shot experiments previously presented in this section: we selected the best checkpoint according to the weighted F1 score in the classification task and kept the same fixed hyperparameters.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: zero-shot-classification
3
+ license: apache-2.0
4
+ language:
5
+ - ca
6
+ tags:
7
+ - zero-shot
8
+ - text-classification
9
+ widget:
10
+ - text: Albert Serra estrenarà a Catalunya la videoinstal·lació 'Personalien' dins del Temporada Alta. El festival programa vuit pel·lícules de cinema per aquesta edició.
11
+ candidate_labels: societat, política, cultura, economia
12
+ multi_class: true
13
+ hypothesis_template: Aquest article tracta sobre {}.
14
+ ---
15
 
16
+ # RoBERTa-ca-CaWikiTC
17
+
18
+ ## Overview
19
+
20
+ <details>
21
+ <summary>Click to expand</summary>
22
+
23
+ - **Model type:** Language Model
24
+ - **Architecture:** RoBERTa-base
25
+ - **Language:** Catalan
26
+ - **License:** Apache 2.0
27
+ - **Task:** Zero-Shot Text Classification
28
+ - **Data:** CaWikiTC
29
+ </details>
30
+
31
+ ## Model description
32
+
33
+ RoBERTa-ca-CaWikiTC is a Zero-Shot Text Classification model in Catalan created by fine-tuning [RoBERTa-base-ca-v2](https://huggingface.co/projecte-aina/roberta-large-ca-v2) with a classification dataset, CaWikiTC, reformulated as entailment. This model was developed as part of the experimental research presented in the following paper ["Entailment-based Task Transfer for Catalan Text Classification in Small Data Regimes"]().
34
+
35
+ ## Intended uses and limitations
36
+
37
+ This model can be used for zero-shot text classification in Catalan. It has been trained with a fixed hypothesis template, "Aquest article tracta sobre {}.", and Wikipedia-based articles as premises, and may not generalize well for all use cases.
38
+
39
+ ## How to use
40
+
41
+ ```python
42
+ from transformers import pipeline
43
+
44
+ classifier = pipeline("zero-shot-classification", model="ibaucells/RoBERTa-ca-CaWikiTC")
45
+
46
+ sentence = "Albert Serra estrenarà a Catalunya la videoinstal·lació 'Personalien' dins del Temporada Alta. El festival programa vuit pel·lícules de cinema per aquesta edició."
47
+ candidate_labels = ["societat", "política", "cultura", "economia"]
48
+ template = "Aquest article tracta sobre {}."
49
+
50
+ output = classifier(sentence, candidate_labels, hypothesis_template=template, multi_label=False)
51
+
52
+ print(output)
53
+ print(f'Predicted class: {output["labels"][0]}')
54
+ ```
55
+
56
+ ## Limitations and bias
57
+
58
+ No measures have been taken to estimate the bias and toxicity embedded in the model.
59
+
60
+ ## Training
61
+
62
+ ### Training data
63
+
64
+ This model was fine-tuned for the Natural Language Inference (NLI) task on an authomatically Wikipedia-based text classification dataset, [CaWikiTC](https://huggingface.co/ibaucells/CaWikiTC), reformulated as entailment. In the reformulation process, we generated two NLI examples for each text classification instance (text and label): an entailment example and a non-entailment example. In both cases, we employed the text as the premise and utilized a shared template to create the hypothesis ("Aquest article tracta {}."), which was completed with the correct label for the entailment example and a randomly-selected label from the remaining options for the non-entailment example.
65
+
66
+ ### Training procedure
67
+
68
+ The pre-trained Catalan model [RoBERTa-base-ca-v2](https://huggingface.co/projecte-aina/roberta-large-ca-v2) was fine-tuned with the training data using a learning rate of 3e-5, a batch size of 16, seed 26 and a maximum of 10 epochs. The development set (converted into entailment) was used to select the best checkpoint according to the highest weighted F1 score in the classification task, which was obtained in the first epoch.
69
+
70
+ ## Evaluation
71
+
72
+ ### Evaluation results
73
+
74
+ This model was evaluated for the TeCla zero-shot text classification task (without specific fine-tuning for the task) and obtained weighted F1 scores of 75.0 in the coarse-grained task (4 classes) and 49.1 in the fine-grained task (53 classes).
75
+
76
+ ## Additional information
77
+
78
+ ### Contact
79
+
80
+ For further information, send an email to either <irene.baucells@bsc.es>.
81
+
82
+ ### License
83
+
84
+ This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
85
+
86
+ ### Funding
87
+
88
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
89
+
90
+ ### Citation
91
+
92
+ ### Disclaimer
93
+
94
+ <details>
95
+ <summary>Click to expand</summary>
96
+ The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
97
+
98
+ When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
99
+
100
+ In no event shall the owner and creator of the models be liable for any results arising from the use made by third parties of these models.
101
+ </details>