giotvr commited on
Commit
06d618f
1 Parent(s): d4f13bb

Updates README

Browse files

Signed-off-by: Giovani <giovanitavares@outlook.com>

Files changed (1) hide show
  1. README.md +96 -94
README.md CHANGED
@@ -11,11 +11,9 @@ metrics:
11
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
 
14
- This is a XLM-RoBERTa-base fine-tuned model on 5K (premise, hypothesis) sentence pairs from
15
- the ASSIN (Avaliação de Similaridade Semântica e Inferência textual) corpus. Both the original corpus
16
- and XLM-RoBERTa-base model can be found here. The original reference papers are:
17
- Unsupervised Cross-Lingual Representation Learning At Scale, ASSIN: Avaliação de Similaridade Semântica e
18
- Inferência Textual, respectivelly. This model is suitable for Portuguese (from Brazil or Portugal).
19
 
20
  ## Model Details
21
 
@@ -23,13 +21,12 @@ Inferência Textual, respectivelly. This model is suitable for Portuguese (from
23
 
24
  <!-- Provide a longer summary of what this model is. -->
25
 
26
-
27
-
28
  - **Developed by:** Giovani Tavares and Felipe Ribas Serras
 
29
  - **Shared by [optional]:** [More Information Needed]
30
  - **Model type:** Transformer-based text classifier
31
  - **Language(s) (NLP):** Portuguese
32
- - **License:** [More Information Needed]
33
  - **Finetuned from model [optional]:** [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base)
34
 
35
  ### Model Sources [optional]
@@ -46,113 +43,143 @@ Inferência Textual, respectivelly. This model is suitable for Portuguese (from
46
 
47
  ### Direct Use
48
 
49
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
-
51
- [More Information Needed]
52
 
53
- ### Downstream Use [optional]
54
 
55
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
 
57
- [More Information Needed]
 
 
 
 
 
 
 
 
58
 
59
- ### Out-of-Scope Use
60
 
61
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
 
63
- [More Information Needed]
64
-
65
- ## Bias, Risks, and Limitations
66
 
67
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
 
69
- [More Information Needed]
70
-
71
  ### Recommendations
72
 
73
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
 
75
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
 
77
- ## How to Get Started with the Model
78
 
79
  Use the code below to get started with the model.
80
 
81
- [More Information Needed]
82
 
83
  ## Fine-Tuning Details
84
 
85
  ### Fine-Tuning Data
86
 
87
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
88
 
89
- This is a fine tuned version of [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) using the [ASSIN (Avaliação de Similaridade Semântica e Inferência textual)](https://huggingface.co/datasets/assin)
90
- [More Information Needed] dataset. [ASSIN](https://huggingface.co/datasets/assin) is a corpus annotated with hypothesis/premise Portuguese sentence pairs suitable for detecting textual entailment, paraphrase or neutral
 
 
 
 
 
 
 
 
 
91
  relationship between the members of such pairs. Such corpus has three subsets: *ptbr* (Brazilian Portuguese), *ptpt* (Portuguese Portuguese) and *full* (the union of the latter with the former). The *full* subset has
92
- $10k$ sentence pairs equally distributed between *ptbr* and *ptpt* subsets.
93
- [More Information Needed]
94
 
95
  ### Fine-Tuning Procedure
96
 
97
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
98
- The model fine-tuning procedure can be summarized in three major subsequent tasks:
99
  <ol type="i">
100
- <li>**[Data Processing](#data-processing):**</li> [ASSIN](https://huggingface.co/datasets/assin)'s *validation* and *train* splits were loaded from the **Hugging Face Hub** and processed afterwards;
101
  <li>**Hyperparameter Tuning:**</li> [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base)'s hyperparameters were chosen with the help of the [Weights & Biases] API to track the results and upload the fine-tuned models;
102
  <li>**Final Model Loading and Testing:**</li>
103
  using the *cross-tests* approach described in the [this section](#evaluation), the models' performance were measured using different datasets and metrics.
104
  </ol>
105
- #### Data Processing [optional]
106
- ##### Class Label Column Renaming
107
- The **Hugging Face**'s ```transformers``` module's ```DataCollator``` used by its ```Trainer``` requires that the ```class label``` column of the collated dataset to be called ```label```. [ASSIN](https://huggingface.co/datasets/assin)'s class label column for each hypothesis/premise pair is called ```entailment_judgement```. Therefore, as the first step of the data preprocessing pipeline the column ```entailment_judgement``` was renamed to ```label``` so that the **Hugging Face**'s ```transformers``` module's ```Trainer``` could be used.
108
 
109
- #### Training Hyperparameters
110
 
111
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
112
 
113
- #### Speeds, Sizes, Times [optional]
 
114
 
115
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
116
 
117
- [More Information Needed]
118
 
119
- ## Evaluation
120
 
121
- <!-- This section describes the evaluation protocols and provides the results. -->
122
 
123
- ### Testing Data, Factors & Metrics
 
 
 
124
 
125
- #### Testing Data
126
 
127
- <!-- This should link to a Data Card if possible. -->
 
 
128
 
129
- [More Information Needed]
130
 
131
- #### Factors
132
 
133
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
134
 
135
- [More Information Needed]
136
 
137
- #### Metrics
138
 
139
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
140
 
141
- [More Information Needed]
142
 
143
- ### Results
144
 
145
- [More Information Needed]
146
 
147
- #### Summary
148
 
 
149
 
 
150
 
151
- ## Model Examination [optional]
152
 
153
- <!-- Relevant interpretability work for the model goes here -->
 
154
 
155
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
156
 
157
  ## Environmental Impact
158
 
@@ -166,50 +193,25 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
166
  - **Compute Region:** [More Information Needed]
167
  - **Carbon Emitted:** [More Information Needed]
168
 
169
- ## Technical Specifications [optional]
170
-
171
- ### Model Architecture and Objective
172
-
173
- [More Information Needed]
174
-
175
- ### Compute Infrastructure
176
-
177
- [More Information Needed]
178
-
179
- #### Hardware
180
-
181
- [More Information Needed]
182
-
183
- #### Software
184
-
185
- [More Information Needed]
186
-
187
- ## Citation [optional]
188
 
189
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
190
 
191
  **BibTeX:**
192
 
193
- [More Information Needed]
194
-
195
- **APA:**
196
-
197
- [More Information Needed]
198
-
199
- ## Glossary [optional]
200
-
201
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
202
-
203
- [More Information Needed]
204
-
205
- ## More Information [optional]
206
-
207
- [More Information Needed]
208
 
209
- ## Model Card Authors [optional]
210
 
211
- [More Information Needed]
212
 
213
- ## Model Card Contact
214
 
215
- [More Information Needed]
 
11
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
 
14
+ This is a **[XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) fine-tuned model** on 5K (premise, hypothesis) sentence pairs from
15
+ the **ASSIN (Avaliação de Similaridade Semântica e Inferência textual)** corpus. The original reference papers are:
16
+ [Unsupervised Cross-Lingual Representation Learning At Scale](https://arxiv.org/pdf/1911.02116), [ASSIN: Avaliação de Similaridade Semântica e Inferência Textual](https://huggingface.co/datasets/assin), respectivelly. This model is suitable for Portuguese (from Brazil or Portugal).
 
 
17
 
18
  ## Model Details
19
 
 
21
 
22
  <!-- Provide a longer summary of what this model is. -->
23
 
 
 
24
  - **Developed by:** Giovani Tavares and Felipe Ribas Serras
25
+ - **Oriented By:** Felipe Ribas Serras, Renata Wassermann and Marcelo Finger
26
  - **Shared by [optional]:** [More Information Needed]
27
  - **Model type:** Transformer-based text classifier
28
  - **Language(s) (NLP):** Portuguese
29
+ - **License:** mit
30
  - **Finetuned from model [optional]:** [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base)
31
 
32
  ### Model Sources [optional]
 
43
 
44
  ### Direct Use
45
 
46
+ This fine-tuned version of [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) performs Natural
47
+ Language Inference (NLI), which is a text classification task.
 
48
 
49
+ <div id="assin_function">
50
 
51
+ **Definition 1.** Given a pair of sentences $(premise, hypothesis)$, let $\hat{f}^{(xlmr\_base)}$ be the fine-tuned models' inference function:
52
 
53
+ $$
54
+ \hat{f}^{(xlmr\_base)} =
55
+ \begin{cases}
56
+ ENTAILMENT, & \text{if $premise$ entails $hypothesis$}\\
57
+ PARAPHRASE, & \text{if $premise$ entails $hypothesis$ and $hypothesis$ entails $premise$}\\
58
+ NONE & \text{otherwise}
59
+ \end{cases}
60
+ $$
61
+ </div>
62
 
 
63
 
64
+ The $(premise, hypothesis)$ entailment definition used is the same as the one found in Salvatore's paper [1].
65
 
66
+ Therefore, **this fine-tuned version of [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) classifies pairs of sentences into one of the following classes $ENTAILMENT, PARAPHRASE$ or $NONE$.** using [Definition 1](#assin_function).
67
+
68
+ <!-- ## Bias, Risks, and Limitations
69
 
70
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
71
 
 
 
72
  ### Recommendations
73
 
74
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
75
 
76
+ This model should be used for scientific purposes only. It was not tested for production environments.
77
 
78
+ <!-- ## How to Get Started with the Model
79
 
80
  Use the code below to get started with the model.
81
 
82
+ [More Information Needed] -->
83
 
84
  ## Fine-Tuning Details
85
 
86
  ### Fine-Tuning Data
87
 
88
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
89
+ ---
90
 
91
+ - **Train Dataset**: [ASSIN](https://huggingface.co/datasets/assin) <br>
92
+
93
+ - **Evaluation Dataset used for Hyperparameter Tuning:** [ASSIN](https://huggingface.co/datasets/assin)'s validation split
94
+
95
+ - **Test Datasets:**
96
+ - [ASSIN](https://huggingface.co/datasets/assin)'s test splits
97
+ - [ASSIN2](https://huggingface.co/datasets/assin2)'s test splits
98
+
99
+
100
+ ---
101
+ This is a fine tuned version of [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) using the [ASSIN (Avaliação de Similaridade Semântica e Inferência textual)](https://huggingface.co/datasets/assin) dataset. [ASSIN](https://huggingface.co/datasets/assin) is a corpus annotated with hypothesis/premise Portuguese sentence pairs suitable for detecting textual entailment, paraphrase or neutral
102
  relationship between the members of such pairs. Such corpus has three subsets: *ptbr* (Brazilian Portuguese), *ptpt* (Portuguese Portuguese) and *full* (the union of the latter with the former). The *full* subset has
103
+ 10k sentence pairs equally distributed between *ptbr* and *ptpt* subsets.
 
104
 
105
  ### Fine-Tuning Procedure
106
 
107
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
108
+ The model's fine-tuning procedure can be summarized in three major subsequent tasks:
109
  <ol type="i">
110
+ <li>**Data Processing:**</li> [ASSIN](https://huggingface.co/datasets/assin)'s *validation* and *train* splits were loaded from the **Hugging Face Hub** and processed afterwards;
111
  <li>**Hyperparameter Tuning:**</li> [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base)'s hyperparameters were chosen with the help of the [Weights & Biases] API to track the results and upload the fine-tuned models;
112
  <li>**Final Model Loading and Testing:**</li>
113
  using the *cross-tests* approach described in the [this section](#evaluation), the models' performance were measured using different datasets and metrics.
114
  </ol>
 
 
 
115
 
116
+ More information on the fine-tuning procedure can be found in [@tcc_paper].
117
 
 
118
 
119
+ <!-- ##### Column Renaming
120
+ The **Hugging Face**'s ```transformers``` module's ```DataCollator``` used by its ```Trainer``` requires that the ```class label``` column of the collated dataset to be called ```label```. [ASSIN](https://huggingface.co/datasets/assin)'s class label column for each hypothesis/premise pair is called ```entailment_judgement```. Therefore, as the first step of the data preprocessing pipeline the column ```entailment_judgement``` was renamed to ```label``` so that the **Hugging Face**'s ```transformers``` module's ```Trainer``` could be used. -->
121
 
122
+ #### Hyperparameter Tuning
123
 
124
+ The model's training hyperparameters were chosen according to the following definition:
125
 
126
+ <div id="hyperparameter_tuning">
127
 
128
+ **Definition 2.** Let $Hyperparms= \{i: i \text{ is an hyperparameter of } \hat{f}^{(xlmr\_base)}\}$ and $\hat{f}^{(xlmr\_base)}$ be the model's inference function defined in [Definition 1](#assin_function) :
129
 
130
+ $$
131
+ Hyperparms = \argmax_{hyp}(eval\_acc(\hat{f}^{(xlmr\_base)}_{hyp}, assin\_validation))
132
+ $$
133
+ </div>
134
 
135
+ The following hyperparameters were tested in order to maximize the evaluation accuracy.
136
 
137
+ - **Number of Training Epochs:** $(1,2,3)$
138
+ - **Per Device Train Batch Size:** $(16,32)$
139
+ - **Learning Rate:** $(1e-6, 2e-6,3e-6)$
140
 
 
141
 
142
+ The hyperaparemeter tuning experiments were run and tracked using the [Weights & Biases' API](https://docs.wandb.ai/ref/python/public-api/api) and can be found at this [link](https://wandb.ai/gio_projs/assin_xlm_roberta_v5?workspace=user-giogvn).
143
 
 
144
 
145
+ #### Training Hyperparameters
146
 
147
+ The [hyperparameter tuning](#hyperparameter-tuning) performed yelded the following values:
148
 
149
+ - **Number of Training Epochs:** $3$
150
+ - **Per Device Train Batch Size:** $16$
151
+ - **Learning Rate:** $3e-6$
152
+
153
+ ## Evaluation
154
 
155
+ ### ASSIN
156
 
157
+ Testing this model in [ASSIN](https://huggingface.co/datasets/assin)'s test split is straightforward. The following code snippet shows how to do it:
158
 
159
+ ### ASSIN2
160
 
161
+ Given a pair of sentences $(premise, hypothesis)$, $\hat{f}^{(xlmr\_base)}(premise, hypothesis)$ can be equal to $PARAPHRASE, ENTAILMENT$ or $NONE$ as defined in [Definition 1](#assin_function).
162
 
163
+ [ASSIN2](https://huggingface.co/datasets/assin2)'s test split's class label's column has only two possible values: $ENTAILMENT$ and $NONE$. Therefore, in order to test this model in [ASSIN2](https://huggingface.co/datasets/assin2)'s test split some mapping must be done in order to make the [ASSIN2](https://huggingface.co/datasets/assin2)' class labels compatible with the model's inference function.
164
 
165
+ More information on how such mapping is performed can be found in [Modelos para Inferência em Linguagem Natural que entendem a Língua Portuguesa](https://linux.ime.usp.br/~giovani/).
166
 
167
+ ### Metrics
168
 
169
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
170
+ The model's performance metrics for each test dataset are presented separately. Accuracy, f1 score, precision and recall were the metrics used to every evaluation performed. Such metrics are reported below. More information on such metrics them can be found in [2].
171
 
172
+ ### Results
173
+
174
+ | test set | accuracy | f1 score | precision | recall |
175
+ |----------|----------|----------|-----------|--------|
176
+ | assin |0.89 |0.89 |0.89 |0.89 |
177
+ | assin2 |0.70 |0.69 |0.73 |0.70 |
178
+
179
+ ## Model Examination
180
+
181
+ <!-- Relevant interpretability work for the model goes here -->
182
+ Some interpretability work was done in order to understand the model's behavior. Such work can be found in the paper describing the procedure to create this fine-tuned model in [@tcc_paper].
183
 
184
  ## Environmental Impact
185
 
 
193
  - **Compute Region:** [More Information Needed]
194
  - **Carbon Emitted:** [More Information Needed]
195
 
196
+ ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
199
 
200
  **BibTeX:**
201
 
202
+ ```bibtex
203
+ @article{tcc_paper,
204
+ author = {Giovani Tavares and Felipe Ribas Serras and Renata Wassermann and Marcelo Finger},
205
+ title = {Modelos Transformer para Inferência de Linguagem Natural em Português},
206
+ pages = {x--y},
207
+ year = {2023}
208
+ }
209
+ ```
 
 
 
 
 
 
 
210
 
211
+ ## References
212
 
213
+ [1][Salvatore, F. S. (2020). Analyzing Natural Language Inference from a Rigorous Point of View (pp. 1-2).](https://www.teses.usp.br/teses/disponiveis/45/45134/tde-05012021-151600/publico/tese_de_doutorado_felipe_salvatore.pdf)
214
 
215
+ [2][Andrade, G. T. (2023) Modelos para Inferência em Linguagem Natural que entendem a Língua Portuguesa (train_assin_xlmr_base_results PAGES GO HERE)](https://linux.ime.usp.br/~giovani/)
216
 
217
+ [3][Andrade, G. T. (2023) Modelos para Inferência em Linguagem Natural que entendem a Língua Portuguesa (train_assin_xlmr_base_conclusions PAGES GO HERE)](https://linux.ime.usp.br/~giovani/)