Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
Inference Endpoints
TajaKuzman commited on
Commit
83a7051
·
1 Parent(s): 0e3e5b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +247 -0
README.md CHANGED
@@ -1,3 +1,250 @@
1
  ---
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
3
+
4
+ language:
5
+ - multilingual
6
+ - af
7
+ - am
8
+ - ar
9
+ - as
10
+ - az
11
+ - be
12
+ - bg
13
+ - bn
14
+ - br
15
+ - bs
16
+ - ca
17
+ - cs
18
+ - cy
19
+ - da
20
+ - de
21
+ - el
22
+ - en
23
+ - eo
24
+ - es
25
+ - et
26
+ - eu
27
+ - fa
28
+ - fi
29
+ - fr
30
+ - fy
31
+ - ga
32
+ - gd
33
+ - gl
34
+ - gu
35
+ - ha
36
+ - he
37
+ - hi
38
+ - hr
39
+ - hu
40
+ - hy
41
+ - id
42
+ - is
43
+ - it
44
+ - ja
45
+ - jv
46
+ - ka
47
+ - kk
48
+ - km
49
+ - kn
50
+ - ko
51
+ - ku
52
+ - ky
53
+ - la
54
+ - lo
55
+ - lt
56
+ - lv
57
+ - mg
58
+ - mk
59
+ - ml
60
+ - mn
61
+ - mr
62
+ - ms
63
+ - my
64
+ - ne
65
+ - nl
66
+ - no
67
+ - om
68
+ - or
69
+ - pa
70
+ - pl
71
+ - ps
72
+ - pt
73
+ - ro
74
+ - ru
75
+ - sa
76
+ - sd
77
+ - si
78
+ - sk
79
+ - sl
80
+ - so
81
+ - sq
82
+ - sr
83
+ - su
84
+ - sv
85
+ - sw
86
+ - ta
87
+ - te
88
+ - th
89
+ - tl
90
+ - tr
91
+ - ug
92
+ - uk
93
+ - ur
94
+ - uz
95
+ - vi
96
+ - xh
97
+ - yi
98
+ - zh
99
+
100
+ tags:
101
+ - text-classification
102
+ - genre
103
+ - text-genre
104
+
105
+ widget:
106
+ - text: "On our site, you can find a great genre identification model which you can use for thousands of different tasks. For free!"
107
+
108
  ---
109
+
110
+ # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres
111
+
112
+ Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
113
+
114
+ ## Model description
115
+
116
+ ### Fine-tuning hyperparameters
117
+
118
+ Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
119
+
120
+ ```python
121
+ model_args= {
122
+ "num_train_epochs": 15,
123
+ "learning_rate": 1e-5,
124
+ "max_seq_length": 512,
125
+ }
126
+
127
+ ```
128
+
129
+ ## Intended use and limitations
130
+
131
+ ## Usage
132
+
133
+ ### Use examples
134
+
135
+ ```python
136
+ from simpletransformers.classification import ClassificationModel
137
+ model_args= {
138
+ "num_train_epochs": 15,
139
+ "learning_rate": 1e-5,
140
+ "max_seq_length": 512,
141
+ }
142
+ model = ClassificationModel(
143
+ "xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True,
144
+ args=model_args
145
+
146
+ )
147
+ predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.",
148
+ "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
149
+ )
150
+ predictions
151
+ ### Output:
152
+ ### array([1, 0])
153
+ ```
154
+
155
+ ## Performance
156
+
157
+
158
+ ## Citation
159
+
160
+ If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained:
161
+
162
+ ```
163
+ @misc{Kuzman2022,
164
+ author = {Kuzman, Taja},
165
+ title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
166
+ year = {2022},
167
+ publisher = {GitHub},
168
+ journal = {GitHub repository},
169
+ howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
170
+ }
171
+ ```
172
+
173
+ and the following paper on which the original model is based:
174
+ ```
175
+ @article{DBLP:journals/corr/abs-1911-02116,
176
+ author = {Alexis Conneau and
177
+ Kartikay Khandelwal and
178
+ Naman Goyal and
179
+ Vishrav Chaudhary and
180
+ Guillaume Wenzek and
181
+ Francisco Guzm{\'{a}}n and
182
+ Edouard Grave and
183
+ Myle Ott and
184
+ Luke Zettlemoyer and
185
+ Veselin Stoyanov},
186
+ title = {Unsupervised Cross-lingual Representation Learning at Scale},
187
+ journal = {CoRR},
188
+ volume = {abs/1911.02116},
189
+ year = {2019},
190
+ url = {http://arxiv.org/abs/1911.02116},
191
+ eprinttype = {arXiv},
192
+ eprint = {1911.02116},
193
+ timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
194
+ biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
195
+ bibsource = {dblp computer science bibliography, https://dblp.org}
196
+ }
197
+ ```
198
+
199
+ To cite the datasets that were used for fine-tuning:
200
+
201
+ CORE dataset:
202
+
203
+ ```
204
+ @article{egbert2015developing,
205
+ title={Developing a bottom-up, user-based method of web register classification},
206
+ author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
207
+ journal={Journal of the Association for Information Science and Technology},
208
+ volume={66},
209
+ number={9},
210
+ pages={1817--1831},
211
+ year={2015},
212
+ publisher={Wiley Online Library}
213
+ }
214
+ ```
215
+
216
+ GINCO dataset:
217
+
218
+ ```
219
+ @InProceedings{kuzman-rupnik-ljubei:2022:LREC,
220
+ author = {Kuzman, Taja and Rupnik, Peter and Ljube{\v{s}}i{\'c}, Nikola},
221
+ title = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
222
+ booktitle = {Proceedings of the Language Resources and Evaluation Conference},
223
+ month = {},
224
+ year = {2022},
225
+ address = {Marseille, France},
226
+ publisher = {European Language Resources Association},
227
+ pages = {1584--1594},
228
+ url = {https://aclanthology.org/2022.lrec-1.170}
229
+ }
230
+ ```
231
+
232
+ FTD dataset:
233
+
234
+ ```
235
+ @article{sharoff2018functional,
236
+ title={Functional text dimensions for the annotation of web corpora},
237
+ author={Sharoff, Serge},
238
+ journal={Corpora},
239
+ volume={13},
240
+ number={1},
241
+ pages={65--95},
242
+ year={2018},
243
+ publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
244
+ }
245
+ ```
246
+
247
+ The datasets are available at:
248
+ 1. http://hdl.handle.net/11356/1467 (GINCO)
249
+ 2. https://github.com/TurkuNLP/CORE-corpus (CORE)
250
+ 3. https://github.com/ssharoff/genre-keras (FTD)