gonzalez-agirre commited on
Commit
65087e0
1 Parent(s): 33654b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -4
README.md CHANGED
@@ -52,12 +52,58 @@ widget:
52
 
53
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Text Classification.
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  The **roberta-base-ca-v2-cased-tc** is a Text Classification (TC) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
56
 
57
- ## Datasets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  We used the TC dataset in Catalan called [TeCla](https://huggingface.co/datasets/projecte-aina/tecla) for training and evaluation.
59
 
60
- ## Evaluation and results
 
 
 
 
 
 
 
 
 
61
  We evaluated the _roberta-base-ca-v2-cased-tc_ on the TeCla test set against standard multilingual and monolingual baselines:
62
 
63
  | Model | TeCla (Accuracy) |
@@ -69,7 +115,11 @@ We evaluated the _roberta-base-ca-v2-cased-tc_ on the TeCla test set against sta
69
 
70
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
71
 
72
- ## Citing
 
 
 
 
73
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
74
  ```bibtex
75
  @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -94,4 +144,9 @@ If you use any of these resources (datasets or models) in your work, please cite
94
  ```
95
 
96
  ### Funding
97
- This work was funded by the [Catalan Government](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of the [AINA project.](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
 
 
 
 
 
 
52
 
53
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Text Classification.
54
 
55
+ ## Table of Contents
56
+ - [Model Description](#model-description)
57
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
58
+ - [How to Use](#how-to-use)
59
+ - [Training](#training)
60
+ - [Training Data](#training-data)
61
+ - [Training Procedure](#training-procedure)
62
+ - [Evaluation](#evaluation)
63
+ - [Variable and Metrics](#variable-and-metrics)
64
+ - [Evaluation Results](#evaluation-results)
65
+ - [Licensing Information](#licensing-information)
66
+ - [Citation Information](#citation-information)
67
+ - [Funding](#funding)
68
+ - [Contributions](#contributions)
69
+
70
+ ## Model description
71
  The **roberta-base-ca-v2-cased-tc** is a Text Classification (TC) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
72
 
73
+ ## Intended Uses and Limitations
74
+
75
+ **roberta-base-ca-v2-cased-tc** model can be used to classify texts. The model is limited by its training dataset and may not generalize well for all use cases.
76
+
77
+ ## How to Use
78
+
79
+ Here is how to use this model:
80
+
81
+ ```python
82
+ from transformers import pipeline
83
+ from pprint import pprint
84
+
85
+ nlp = pipeline("text-classification", model="projecte-aina/roberta-base-ca-v2-cased-tc")
86
+ example = "Retards a quatre línies de Rodalies per una avaria entre Sants i plaça de Catalunya."
87
+
88
+ tc_results = nlp(example)
89
+ pprint(tc_results)
90
+ ```
91
+
92
+ ## Training
93
+
94
+ ### Training data
95
  We used the TC dataset in Catalan called [TeCla](https://huggingface.co/datasets/projecte-aina/tecla) for training and evaluation.
96
 
97
+ ### Training Procedure
98
+ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
99
+
100
+ ## Evaluation
101
+
102
+ ### Variable and Metrics
103
+
104
+ This model was finetuned maximizing accuracy.
105
+
106
+ ## Evaluation results
107
  We evaluated the _roberta-base-ca-v2-cased-tc_ on the TeCla test set against standard multilingual and monolingual baselines:
108
 
109
  | Model | TeCla (Accuracy) |
 
115
 
116
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
117
 
118
+ ## Licensing Information
119
+
120
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
121
+
122
+ ## Citation Information
123
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
124
  ```bibtex
125
  @inproceedings{armengol-estape-etal-2021-multilingual,
 
144
  ```
145
 
146
  ### Funding
147
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
148
+
149
+
150
+ ## Contributions
151
+
152
+ [N/A]