mmarimon commited on
Commit
08af285
1 Parent(s): 87f4981

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -29
README.md CHANGED
@@ -35,20 +35,24 @@ widget:
35
  <details>
36
  <summary>Click to expand</summary>
37
 
38
- - [Model Description](#model-description)
39
- - [Intended Uses and Limitations](#intended-uses-and-limitations)
40
- - [How to Use](#how-to-use)
 
41
  - [Training](#training)
42
- - [Training Data](#training-data)
43
- - [Training Procedure](#training-procedure)
44
  - [Evaluation](#evaluation)
45
- - [CLUB Benchmark](#club-benchmark)
46
- - [Evaluation Results](#evaluation-results)
47
- - [Licensing Information](#licensing-information)
48
- - [Citation Information](#citation-information)
49
- - [Funding](#funding)
50
- - [Contributions](#contributions)
51
- - [Disclaimer](#disclaimer)
 
 
 
52
 
53
  </details>
54
 
@@ -58,12 +62,12 @@ The **roberta-large-ca-v2** is a transformer-based masked language model for the
58
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) large model
59
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
60
 
61
- ## Intended Uses and Limitations
62
 
63
  **roberta-large-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
64
  However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
65
 
66
- ## How to Use
67
 
68
  Here is how to use this model:
69
 
@@ -80,6 +84,9 @@ res_hf = pipeline(text)
80
  pprint([r['token_str'] for r in res_hf])
81
  ```
82
 
 
 
 
83
  ## Training
84
 
85
  ### Training data
@@ -104,7 +111,7 @@ The training corpus consists of several corpora gathered from web crawling and p
104
  | Vilaweb | 0.06 |
105
  | Tweets | 0.02 |
106
 
107
- ### Training Procedure
108
 
109
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
110
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
@@ -115,7 +122,7 @@ The training lasted a total of 96 hours with 32 NVIDIA V100 GPUs of 16GB DDRAM.
115
 
116
  ## Evaluation
117
 
118
- ### CLUB Benchmark
119
 
120
  The BERTa-large model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
121
  that has been created along with the model.
@@ -168,7 +175,7 @@ Here are the train/dev/test splits of the datasets:
168
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
169
  | QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
170
 
171
- ### Evaluation Results
172
 
173
  | Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
174
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
@@ -180,11 +187,24 @@ Here are the train/dev/test splits of the datasets:
180
 
181
  <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
182
 
183
- ## Licensing Information
 
 
 
 
 
 
184
 
 
 
 
 
185
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
186
 
187
- ## Citation Information
 
 
 
188
 
189
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
190
  ```bibtex
@@ -209,16 +229,7 @@ If you use any of these resources (datasets or models) in your work, please cite
209
  }
210
  ```
211
 
212
- ## Funding
213
-
214
- This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
215
-
216
- ## Contributions
217
-
218
- [N/A]
219
-
220
-
221
- ## Disclaimer
222
 
223
  <details>
224
  <summary>Click to expand</summary>
 
35
  <details>
36
  <summary>Click to expand</summary>
37
 
38
+ - [Model description](#model-description)
39
+ - [Intended uses and limitations](#intended-use)
40
+ - [How to use](#how-to-use)
41
+ - [Limitations and bias](#limitations-and-bias)
42
  - [Training](#training)
43
+ - [Training data](#training-data)
44
+ - [Training procedure](#training-procedure)
45
  - [Evaluation](#evaluation)
46
+ - [CLUB benchmark](#club-benchmark)
47
+ - [Evaluation results](#evaluation-results)
48
+ - [Additional information](#additional-information)
49
+ - [Author](#author)
50
+ - [Contact information](#contact-information)
51
+ - [Copyright](#copyright)
52
+ - [Licensing information](#licensing-information)
53
+ - [Funding](#funding)
54
+ - [Citing information](#citing-information)
55
+ - [Disclaimer](#disclaimer)
56
 
57
  </details>
58
 
 
62
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) large model
63
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
64
 
65
+ ## Intended uses and limitations
66
 
67
  **roberta-large-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
68
  However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
69
 
70
+ ## How to use
71
 
72
  Here is how to use this model:
73
 
 
84
  pprint([r['token_str'] for r in res_hf])
85
  ```
86
 
87
+ ## Limitations and bias
88
+ At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
89
+
90
  ## Training
91
 
92
  ### Training data
 
111
  | Vilaweb | 0.06 |
112
  | Tweets | 0.02 |
113
 
114
+ ### Training procedure
115
 
116
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
117
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
 
122
 
123
  ## Evaluation
124
 
125
+ ### CLUB benchmark
126
 
127
  The BERTa-large model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
128
  that has been created along with the model.
 
175
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
176
  | QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
177
 
178
+ ### Evaluation results
179
 
180
  | Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
181
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
 
187
 
188
  <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
189
 
190
+ ## Additional information
191
+
192
+ ### Author
193
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
194
+
195
+ ### Contact information
196
+ For further information, send an email to aina@bsc.es
197
 
198
+ ### Copyright
199
+ Copyright (c) 2022 Text Mining Unit at Barcelona Supercomputing Center
200
+
201
+ ### Licensing information
202
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
203
 
204
+ ### Funding
205
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
206
+
207
+ ### Citation information
208
 
209
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
210
  ```bibtex
 
229
  }
230
  ```
231
 
232
+ ### Disclaimer
 
 
 
 
 
 
 
 
 
233
 
234
  <details>
235
  <summary>Click to expand</summary>