ccasimiro commited on
Commit
943cfe8
1 Parent(s): 99f9861

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -9
README.md CHANGED
@@ -30,15 +30,62 @@ widget:
30
 
31
  ---
32
 
33
- # Catalan BERTa (RoBERTa-base) finetuned for Question Answering.
34
 
35
- The **roberta-base-ca-cased-qa** is a Question Answering (QA) model for the Catalan language fine-tuned from the [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the BERTa model card for more details).
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- ## Datasets
38
- We used the QA dataset in Catalan called [ViquiQuAD](https://huggingface.co/datasets/projecte-aina/viquiquad) for training and evaluation, and the [XQuAD-ca](https://huggingface.co/datasets/projecte-aina/xquad-ca) test set for evaluation.
39
 
40
- ## Evaluation and results
41
- We evaluated the _roberta-base-ca-cased-qa_ on the ViquiQuAD and XQuAD-ca test sets against standard multilingual and monolingual baselines:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
 
44
  | Model | ViquiQuAD (F1/EM) | XQuAD-ca (F1/EM) |
@@ -50,10 +97,12 @@ We evaluated the _roberta-base-ca-cased-qa_ on the ViquiQuAD and XQuAD-ca test s
50
 
51
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
52
 
 
53
 
54
- ## Citing
55
- If you use any of these resources (datasets or models) in your work, please cite our latest paper:
56
 
 
 
57
  ```bibtex
58
  @inproceedings{armengol-estape-etal-2021-multilingual,
59
  title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
@@ -74,4 +123,12 @@ If you use any of these resources (datasets or models) in your work, please cite
74
  doi = "10.18653/v1/2021.findings-acl.437",
75
  pages = "4933--4946",
76
  }
77
- ```
 
 
 
 
 
 
 
 
30
 
31
  ---
32
 
33
+ # Catalan BERTa (roberta-base-ca) finetuned for Question Answering.
34
 
35
+ ## Table of Contents
36
+ - [Model Description](#model-description)
37
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
38
+ - [How to Use](#how-to-use)
39
+ - [Training](#training)
40
+ - [Training Data](#training-data)
41
+ - [Training Procedure](#training-procedure)
42
+ - [Evaluation](#evaluation)
43
+ - [Variable and Metrics](#variable-and-metrics)
44
+ - [Evaluation Results](#evaluation-results)
45
+ - [Licensing Information](#licensing-information)
46
+ - [Citation Information](#citation-information)
47
+ - [Funding](#funding)
48
+ - [Contributions](#contributions)
49
 
50
+ ## Model description
 
51
 
52
+ The **roberta-base-ca-cased-qa** is a Question Answering (QA) model for the Catalan language fine-tuned from the roberta-base-ca model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers.
53
+
54
+ ## Intended Uses and Limitations
55
+
56
+ **roberta-base-ca-cased-qa** model can be used for extractive question answering. The model is limited by its training dataset and may not generalize well for all use cases.
57
+
58
+ ## How to Use
59
+
60
+ Here is how to use this model:
61
+
62
+ ```python
63
+ from transformers import pipeline
64
+
65
+ nlp = pipeline("question-answering", model="projecte-aina/roberta-base-ca-cased-qa")
66
+ text = "Quan va començar el Super3?"
67
+ context = "El Super3 o Club Super3 és un univers infantil català creat a partir d'un programa emès per Televisió de Catalunya des del 1991. Està format per un canal de televisió, la revista Súpers!, la Festa dels Súpers i un club que té un milió i mig de socis."
68
+
69
+ qa_results = nlp(text, context)
70
+ print(qa_results)
71
+ ```
72
+
73
+ ## Training
74
+
75
+ ### Training data
76
+ We used the QA dataset in Catalan called [CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa) for training and evaluation, and the [XQuAD-ca](https://huggingface.co/datasets/projecte-aina/xquad-ca) test set for evaluation.
77
+
78
+ ### Training Procedure
79
+ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
80
+
81
+ ## Evaluation
82
+
83
+ ### Variable and Metrics
84
+
85
+ This model was finetuned maximizing F1 score.
86
+
87
+ ### Evaluation results
88
+ We evaluated the _roberta-base-ca-cased-qa_ on the CatalanQA and XQuAD-ca test sets against standard multilingual and monolingual baselines:
89
 
90
 
91
  | Model | ViquiQuAD (F1/EM) | XQuAD-ca (F1/EM) |
97
 
98
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
99
 
100
+ ## Licensing Information
101
 
102
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
103
 
104
+ ## Citation Information
105
+ If you use any of these resources (datasets or models) in your work, please cite our latest paper:
106
  ```bibtex
107
  @inproceedings{armengol-estape-etal-2021-multilingual,
108
  title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
123
  doi = "10.18653/v1/2021.findings-acl.437",
124
  pages = "4933--4946",
125
  }
126
+ ```
127
+
128
+ ### Funding
129
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
130
+
131
+
132
+ ## Contributions
133
+
134
+ [N/A]