crodri commited on
Commit
f4cd32c
1 Parent(s): a999afe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -208
README.md CHANGED
@@ -1,248 +1,131 @@
1
  ---
2
-
3
- language:
4
-
5
- - ca
6
-
7
  license: apache-2.0
8
-
9
- tags:
10
-
11
- - "catalan"
12
-
13
- - "named entity recognition"
14
-
15
- - "ner"
16
-
17
- - "CaText"
18
-
19
- - "Catalan Textual Corpus"
20
-
21
  datasets:
22
-
23
- - "crodri/ceil"
24
-
25
  metrics:
26
-
27
- - f1
28
-
29
- model-index:
30
- - name: multiner
31
- results:
32
- - task:
33
- type: token-classification
34
- dataset:
35
- type: crodri/ceil
36
- name: ancora-ca-ner
37
- metrics:
38
- - type: f1
39
- value: 0.836
40
- - type: precision
41
- value: 0.82069
42
- - type: recall
43
- value: 0.8523
44
 
45
  widget:
46
 
47
- - text: "El raper nord-americà Travis Scott ha gravat el videoclip de la seva cançó "Circus Maximus" amb els Castellers de Vilafranca. Segons ha publicat la 'Revista Castells' i ha confirmat l'Agència Catalana de Notícies (ACN), el rodatge es va fer el 2 de juliol a la Tarraco Arena Plaça (TAP) de Tarragona."
48
 
49
  - text: "L'ANC vol que l'11 de setembre al Passeig de Gracia sigui una fita enguany."
50
 
51
  - text: "El Martí llegeix el Cavall Fort."
52
 
 
53
 
54
- # Model Card for Model ID
55
-
56
- <!-- Provide a quick summary of what the model is/does. -->
57
-
58
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
59
-
60
- ## Model Details
61
-
62
- ### Model Description
63
-
64
- <!-- Provide a longer summary of what this model is. -->
65
-
66
-
67
-
68
- - **Developed by:** [More Information Needed]
69
- - **Shared by [optional]:** [More Information Needed]
70
- - **Model type:** [More Information Needed]
71
- - **Language(s) (NLP):** [More Information Needed]
72
- - **License:** [More Information Needed]
73
- - **Finetuned from model [optional]:** [More Information Needed]
74
-
75
- ### Model Sources [optional]
76
-
77
- <!-- Provide the basic links for the model. -->
78
-
79
- - **Repository:** [More Information Needed]
80
- - **Paper [optional]:** [More Information Needed]
81
- - **Demo [optional]:** [More Information Needed]
82
-
83
- ## Uses
84
-
85
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
86
-
87
- ### Direct Use
88
-
89
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
90
-
91
- [More Information Needed]
92
-
93
- ### Downstream Use [optional]
94
-
95
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
96
-
97
- [More Information Needed]
98
-
99
- ### Out-of-Scope Use
100
-
101
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
102
-
103
- [More Information Needed]
104
-
105
- ## Bias, Risks, and Limitations
106
-
107
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
108
-
109
- [More Information Needed]
110
-
111
- ### Recommendations
112
-
113
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
114
-
115
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
116
-
117
- ## How to Get Started with the Model
118
-
119
- Use the code below to get started with the model.
120
-
121
- [More Information Needed]
122
-
123
- ## Training Details
124
-
125
- ### Training Data
126
-
127
- <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
128
-
129
- [More Information Needed]
130
-
131
- ### Training Procedure
132
-
133
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
134
-
135
- #### Preprocessing [optional]
136
-
137
- [More Information Needed]
138
-
139
-
140
- #### Training Hyperparameters
141
-
142
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
143
-
144
- #### Speeds, Sizes, Times [optional]
145
-
146
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
147
-
148
- [More Information Needed]
149
-
150
- ## Evaluation
151
-
152
- <!-- This section describes the evaluation protocols and provides the results. -->
153
-
154
- ### Testing Data, Factors & Metrics
155
-
156
- #### Testing Data
157
-
158
- <!-- This should link to a Data Card if possible. -->
159
-
160
- [More Information Needed]
161
-
162
- #### Factors
163
-
164
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
165
-
166
- [More Information Needed]
167
-
168
- #### Metrics
169
-
170
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
171
-
172
- [More Information Needed]
173
-
174
- ### Results
175
-
176
- [More Information Needed]
177
-
178
- #### Summary
179
-
180
-
181
-
182
- ## Model Examination [optional]
183
-
184
- <!-- Relevant interpretability work for the model goes here -->
185
-
186
- [More Information Needed]
187
-
188
- ## Environmental Impact
189
-
190
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
191
-
192
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
193
 
194
- - **Hardware Type:** [More Information Needed]
195
- - **Hours used:** [More Information Needed]
196
- - **Cloud Provider:** [More Information Needed]
197
- - **Compute Region:** [More Information Needed]
198
- - **Carbon Emitted:** [More Information Needed]
199
 
200
- ## Technical Specifications [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
 
202
- ### Model Architecture and Objective
203
 
204
- [More Information Needed]
205
 
206
- ### Compute Infrastructure
207
 
208
- [More Information Needed]
209
 
210
- #### Hardware
211
 
212
- [More Information Needed]
213
 
214
- #### Software
215
 
216
- [More Information Needed]
 
217
 
218
- ## Citation [optional]
 
219
 
220
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
221
 
222
- **BibTeX:**
 
 
 
 
 
223
 
224
- [More Information Needed]
225
 
226
- **APA:**
227
 
228
- [More Information Needed]
 
229
 
230
- ## Glossary [optional]
 
231
 
232
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
233
 
234
- [More Information Needed]
 
235
 
236
- ## More Information [optional]
 
237
 
238
- [More Information Needed]
239
 
240
- ## Model Card Authors [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
 
242
- [More Information Needed]
243
 
244
- ## Model Card Contact
 
245
 
246
- [More Information Needed]
247
 
 
248
 
 
 
1
  ---
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  datasets:
4
+ - crodri/ceil
5
+ language:
6
+ - ca
7
  metrics:
8
+ - type: f1
9
+ value: 0.836
10
+ - type: precision
11
+ value: 0.82069
12
+ - type: recall
13
+ value: 0.8523
14
+ pipeline_tag: token-classification
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  widget:
17
 
18
+ - text: "El raper nord-americà Travis Scott ha gravat el videoclip de la seva cançó 'Circus Maximus' amb els Castellers de Vilafranca. Segons ha publicat la 'Revista Castells' i ha confirmat l'Agència Catalana de Notícies (ACN), el rodatge es va fer el 2 de juliol a la Tarraco Arena Plaça (TAP) de Tarragona."
19
 
20
  - text: "L'ANC vol que l'11 de setembre al Passeig de Gracia sigui una fita enguany."
21
 
22
  - text: "El Martí llegeix el Cavall Fort."
23
 
24
+ ---
25
 
26
+ # Catalan BERTa (RoBERTa-large) finetuned for Named Entity Recognition.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ ## Table of Contents
29
+ <details>
30
+ <summary>Click to expand</summary>
 
 
31
 
32
+ - [Model description](#model-description)
33
+ - [Intended uses and limitations](#intended-uses-and-limitations)
34
+ - [How to Use](#how-to-use)
35
+ - [Training](#training)
36
+ - [Training data](#training-data)
37
+ - [Training procedure](#training-procedure)
38
+ - [Evaluation](#evaluation)
39
+ - [Variable and metrics](#variable-and-metrics)
40
+ - [Evaluation results](#evaluation-results)
41
+ - [Additional information](#addional-information)
42
+ - [Author](#author)
43
+ - [Contact information](#contact-information)
44
+ - [Copyright](#copyright)
45
+ - [Licensing information](#licensing-information)
46
+ - [Funding](#funding)
47
+ - [Citing information](#citing-information)
48
+ - [Disclaimer](#disclaimer))
49
+ </details>
50
 
 
51
 
52
+ ## Model description
53
 
54
+ The **multiner** is a Named Entity Recognition (NER) model for the Catalan language fine-tuned from the [BERTa] model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the BERTa model card for more details).
55
 
56
+ ## Intended uses and limitations
57
 
 
58
 
59
+ ## How to use
60
 
 
61
 
62
+ ## Limitations and bias
63
+ At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
64
 
65
+ ## Training
66
+ We used the NER dataset in Catalan called [Ancora-ca-ner](https://huggingface.co/datasets/projecte-aina/ancora-ca-ner) for training and evaluation.
67
 
68
+ ## Evaluation
69
+ We evaluated the _roberta-base-ca-cased-ner_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
70
 
71
+ | Model | Ancora-ca-ner (F1)|
72
+ | ------------|:-------------|
73
+ | roberta-base-ca-cased-ner | **88.13** |
74
+ | mBERT | 86.38 |
75
+ | XLM-RoBERTa | 87.66 |
76
+ | WikiBERT-ca | 77.66 |
77
 
78
+ For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
79
 
80
+ ## Additional information
81
 
82
+ ### Author
83
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
84
 
85
+ ### Contact information
86
+ For further information, send an email to aina@bsc.es
87
 
88
+ ### Copyright
89
+ Copyright (c) 2021 Text Mining Unit at Barcelona Supercomputing Center
90
 
91
+ ### Licensing Information
92
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
93
 
94
+ ### Funding
95
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
96
 
97
+ ### Citation information
98
 
99
+ If you use any of these resources (datasets or models) in your work, please cite our latest paper:
100
+ ```bibtex
101
+ @inproceedings{armengol-estape-etal-2021-multilingual,
102
+ title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
103
+ author = "Armengol-Estap{\'e}, Jordi and
104
+ Carrino, Casimiro Pio and
105
+ Rodriguez-Penagos, Carlos and
106
+ de Gibert Bonet, Ona and
107
+ Armentano-Oller, Carme and
108
+ Gonzalez-Agirre, Aitor and
109
+ Melero, Maite and
110
+ Villegas, Marta",
111
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
112
+ month = aug,
113
+ year = "2021",
114
+ address = "Online",
115
+ publisher = "Association for Computational Linguistics",
116
+ url = "https://aclanthology.org/2021.findings-acl.437",
117
+ doi = "10.18653/v1/2021.findings-acl.437",
118
+ pages = "4933--4946",
119
+ }
120
+ ```
121
 
122
+ ### Disclaimer
123
 
124
+ <details>
125
+ <summary>Click to expand</summary>
126
 
127
+ The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
128
 
129
+ When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
130
 
131
+ In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.