bsc-temu commited on
Commit
eb8ceb7
1 Parent(s): 78ad185

remove repeated text readme

Browse files
Files changed (1) hide show
  1. README.md +1 -69
README.md CHANGED
@@ -111,75 +111,7 @@ It contains the following tasks and their related datasets:
111
 
112
  3. Text Classification (TC)
113
 
114
- **[TeCla](---
115
- language: "ca"
116
- tags:
117
- - masked-lm
118
- - BERTa
119
- - catalan
120
- license: apache-2.0
121
- ---
122
-
123
- # BERTa: RoBERTa-based Catalan language model
124
-
125
- ## BibTeX citation
126
-
127
- If you use any of these resources (datasets or models) in your work, please cite our latest paper:
128
-
129
- ```bibtex
130
- @inproceedings{armengol-estape-etal-2021-multilingual,
131
- title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
132
- author = "Armengol-Estap{\'e}, Jordi and
133
- Carrino, Casimiro Pio and
134
- Rodriguez-Penagos, Carlos and
135
- de Gibert Bonet, Ona and
136
- Armentano-Oller, Carme and
137
- Gonzalez-Agirre, Aitor and
138
- Melero, Maite and
139
- Villegas, Marta",
140
- booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
141
- month = aug,
142
- year = "2021",
143
- address = "Online",
144
- publisher = "Association for Computational Linguistics",
145
- url = "https://aclanthology.org/2021.findings-acl.437",
146
- doi = "10.18653/v1/2021.findings-acl.437",
147
- pages = "4933--4946",
148
- }
149
- ```
150
-
151
-
152
- ## Model description
153
-
154
- BERTa is a transformer-based masked language model for the Catalan language.
155
- It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
156
- and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
157
-
158
- ## Training corpora and preprocessing
159
-
160
- The training corpus consists of several corpora gathered from web crawling and public corpora.
161
-
162
- The publicly available corpora are:
163
-
164
- 1. the Catalan part of the [DOGC](http://opus.nlpl.eu/DOGC-v2.php) corpus, a set of documents from the Official Gazette of the Catalan Government
165
-
166
- 2. the [Catalan Open Subtitles](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.ca.gz), a collection of translated movie subtitles
167
-
168
- 3. the non-shuffled version of the Catalan part of the [OSCAR](https://traces1.inria.fr/oscar/) corpus \\\\cite{suarez2019asynchronous},
169
- a collection of monolingual corpora, filtered from [Common Crawl](https://commoncrawl.org/about/)
170
-
171
- 4. The [CaWac](http://nlp.ffzg.hr/resources/corpora/cawac/) corpus, a web corpus of Catalan built from the .cat top-level-domain in late 2013
172
- the non-deduplicated version
173
-
174
- 5. the [Catalan Wikipedia articles](https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/cawiki/20200801/) downloaded on 18-08-2020.
175
-
176
- The crawled corpora are:
177
-
178
- 6. The Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains
179
- 7. the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government
180
-
181
- 8. the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the [Catalan News Agency](https://www.acn.cat/)
182
- https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
183
 
184
  4. Semantic Textual Similarity (STS)
185
 
 
111
 
112
  3. Text Classification (TC)
113
 
114
+ **[TeCla](https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  4. Semantic Textual Similarity (STS)
117