mmarimon commited on
Commit
fd1e924
1 Parent(s): 0faf786

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -25
README.md CHANGED
@@ -46,26 +46,34 @@ model-index:
46
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Semantic Textual Similarity.
47
 
48
  ## Table of Contents
49
- - [Model Description](#model-description)
50
- - [Intended Uses and Limitations](#intended-uses-and-limitations)
51
- - [How to Use](#how-to-use)
 
 
 
 
52
  - [Training](#training)
53
- - [Training Data](#training-data)
54
- - [Training Procedure](#training-procedure)
55
  - [Evaluation](#evaluation)
56
- - [Variable and Metrics](#variable-and-metrics)
57
- - [Evaluation Results](#evaluation-results)
58
- - [Licensing Information](#licensing-information)
59
- - [Citation Information](#citation-information)
60
- - [Funding](#funding)
61
- - [Contributions](#contributions)
62
- - [Disclaimer](#disclaimer)
 
 
 
 
63
 
64
  ## Model description
65
 
66
  The **roberta-base-ca-v2-cased-sts** is a Semantic Textual Similarity (STS) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
67
 
68
- ## Intended Uses and Limitations
69
 
70
  **roberta-base-ca-v2-cased-sts** model can be used to assess the similarity between two snippets of text. The model is limited by its training dataset and may not generalize well for all use cases.
71
 
@@ -106,17 +114,21 @@ Expected output:
106
 
107
  <sup>1</sup> _**avoid using the widget** scores since they are normalized and do not reflect the original annotation values._
108
 
 
 
 
 
109
  ## Training
110
 
111
  ### Training data
112
  We used the STS dataset in Catalan called [STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca) for training and evaluation.
113
 
114
- ### Training Procedure
115
  The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set, and then evaluated it on the test set.
116
 
117
  ## Evaluation
118
 
119
- ### Variable and Metrics
120
 
121
  This model was finetuned maximizing the average score between the Pearson and Spearman correlations.
122
 
@@ -132,11 +144,24 @@ We evaluated the _roberta-base-ca-v2-cased-sts_ on the STS-ca test set against s
132
 
133
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
134
 
135
- ## Licensing Information
 
 
 
 
 
 
 
 
 
136
 
 
137
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
138
 
139
- ## Citation Information
 
 
 
140
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
141
  ```bibtex
142
  @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -160,14 +185,7 @@ If you use any of these resources (datasets or models) in your work, please cite
160
  }
161
  ```
162
 
163
- ### Funding
164
- This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
165
-
166
- ## Contributions
167
-
168
- [N/A]
169
-
170
- ## Disclaimer
171
 
172
  <details>
173
  <summary>Click to expand</summary>
 
46
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Semantic Textual Similarity.
47
 
48
  ## Table of Contents
49
+ <details>
50
+ <summary>Click to expand</summary>
51
+
52
+ - [Model description](#model-description)
53
+ - [Intended uses and limitations](#intended-use)
54
+ - [How to use](#how-to-use)
55
+ - [Limitations and bias](#limitations-and-bias)
56
  - [Training](#training)
57
+ - [Training data](#training-data)
58
+ - [Training procedure](#training-procedure)
59
  - [Evaluation](#evaluation)
60
+ - [Variable and metrics](#variable-and-metrics)
61
+ - [Evaluation results](#evaluation-results)
62
+ - [Additional information](#additional-information)
63
+ - [Author](#author)
64
+ - [Contact information](#contact-information)
65
+ - [Copyright](#copyright)
66
+ - [Licensing information](#licensing-information)
67
+ - [Funding](#funding)
68
+ - [Citing information](#citing-information)
69
+ - [Disclaimer](#disclaimer)
70
+ </details>
71
 
72
  ## Model description
73
 
74
  The **roberta-base-ca-v2-cased-sts** is a Semantic Textual Similarity (STS) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
75
 
76
+ ## Intended uses and limitations
77
 
78
  **roberta-base-ca-v2-cased-sts** model can be used to assess the similarity between two snippets of text. The model is limited by its training dataset and may not generalize well for all use cases.
79
 
 
114
 
115
  <sup>1</sup> _**avoid using the widget** scores since they are normalized and do not reflect the original annotation values._
116
 
117
+
118
+ ## Limitations and bias
119
+ At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
120
+
121
  ## Training
122
 
123
  ### Training data
124
  We used the STS dataset in Catalan called [STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca) for training and evaluation.
125
 
126
+ ### Training procedure
127
  The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set, and then evaluated it on the test set.
128
 
129
  ## Evaluation
130
 
131
+ ### Variable and metrics
132
 
133
  This model was finetuned maximizing the average score between the Pearson and Spearman correlations.
134
 
 
144
 
145
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
146
 
147
+ ## Additional information
148
+
149
+ ### Author
150
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
151
+
152
+ ### Contact information
153
+ For further information, send an email to aina@bsc.es
154
+
155
+ ### Copyright
156
+ Copyright (c) 2022 Text Mining Unit at Barcelona Supercomputing Center
157
 
158
+ ### Licensing information
159
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
160
 
161
+ ### Funding
162
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
163
+
164
+ ### Citation Information
165
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
166
  ```bibtex
167
  @inproceedings{armengol-estape-etal-2021-multilingual,
 
185
  }
186
  ```
187
 
188
+ ### Disclaimer
 
 
 
 
 
 
 
189
 
190
  <details>
191
  <summary>Click to expand</summary>