File size: 7,630 Bytes
969fdc4
04b2906
d1e0b7b
04b2906
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49aca8a
 
 
 
 
969fdc4
663e709
d1e0b7b
96af35b
f0e6f1e
 
50f0beb
60b4878
d1e0b7b
 
 
 
 
04b2906
d1e0b7b
 
f39e8e0
d1e0b7b
 
 
 
 
 
 
 
7f1a62f
 
56ce875
de15f5f
56ce875
 
17e46b3
7f1a62f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1e0b7b
 
663e709
 
60b4878
663e709
 
 
d1e0b7b
 
663e709
d1e0b7b
663e709
 
 
d1e0b7b
663e709
 
 
 
 
d1e0b7b
663e709
 
 
 
 
 
60d800a
663e709
 
 
 
 
 
 
 
 
 
d1e0b7b
 
 
 
 
663e709
d1e0b7b
 
 
 
 
 
56ce875
d1e0b7b
 
 
 
 
 
7b0cce2
d1e0b7b
7b0cce2
d1e0b7b
 
 
7d1322a
 
 
 
 
 
 
 
 
56ce875
7d1322a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1e0b7b
 
 
 
 
4b1ded9
 
d1e0b7b
 
 
25a7d8d
 
d1e0b7b
 
 
6077489
22bf4b8
05b6d4f
5e6ddd4
 
 
 
 
 
 
 
 
 
22bf4b8
 
 
 
 
 
2bcc1a4
 
05b6d4f
5e6ddd4
d1e0b7b
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
license: apache-2.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- register
- web-register
- genre
---
# Web register classification (multilingual model)

A multilingual web register classifier, fine-tuned from [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large). 
The model is trained with the multilingual CORE corpora across five languages (English, Finnish, French, Swedish, Turkish) to classify documents based on the [CORE taxonomy](https://turkunlp.org/register-annotation-docs/). 
It can predict labels for the 100 languages covered by XLM-RoBERTa-large. The model achieves state-of-the-art performance in classifying web registers for the trained languages and has strong transfer performance (see Evaluation below).
It is designed to support the development of open language models and for linguists analyzing register variation.

## Model Details

### Model Description

- **Developed by:** TurkuNLP
- **Funded by:** The Research Council of Finland, Emil Aaltonen Foundation, University of Turku
- **Shared by:** TurkuNLP
- **Model type:** Language model
- **Language(s) (NLP):** English, Finnish, French, Swedish, Turkish
- **License:** apache-2.0
- **Finetuned from model:** FacebookAI/xlm-roberta-large

### Model Sources

- **Repository:** https://github.com/TurkuNLP/pytorch-registerlabeling
- **Paper:** Coming soon!

## Register labels and their abbreviations

Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted. 
For a more detailed description of the label scheme, see [here](https://turkunlp.org/register-annotation-docs/).

The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels.

- **MT:** Machine translated or generated
- **LY:** Lyrical
- **SP:** Spoken
    - **it:** Interview
- **ID:** Interactive discussion
- **NA:** Narrative
    - **ne:** News report
    - **sr:** Sports report
    - **nb:** Narrative blog
- **HI:** How-to or instructions
    - **re:** Recipe
- **IN:** Informational description
    - **en:** Encyclopedia article
    - **ra:** Research article
    - **dtp:** Description of a thing or person
    - **fi:** Frequently asked questions
    - **lt:** Legal terms and conditions
- **OP:** Opinion
    - **rv:** Review
    - **ob:** Opinion blog
    - **rs:** Denominational religious blog or sermon
    - **av:** Advice
- **IP:** Informational persuasion
    - **ds:** Description with intent to sell
    - **ed:** News & opinion blog or editorial

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "TurkuNLP/multilingual-web-register-classification"

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Text to be categorized
text = "A text to be categorized"

# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)

with torch.no_grad():
    outputs = model(**inputs)

# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()

# Determine a threshold for predicting labels
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]

# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]

print("Predicted labels:", predicted_labels)

```

## Training Details

### Training Data

The model was trained using the Multilingual CORE Corpora, which will be published soon.

### Training Procedure

#### Training Hyperparameters

- **Batch size:** 8
- **Epochs:** 21
- **Learning rate:** 0.00005
- **Precision:** bfloat16 (non-mixed precision)
- **TF32:** Enabled
- **Seed:** 42
- **Max Size:** 512 tokens

#### Inference time

Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example. Wirh bigger batches, inference can be considerably faster.

## Evaluation

Micro-averaged F1 scores and optimized prediction thresholds for the five training languages (test set):

| Language | F1 (All labels) | F1 (Main labels) | Threshold |
| -------- | --------------- | ---------------- | ----------|
| English  | 0.72            | 0.75             | 0.40      |
| Finnish  | 0.79            | 0.82             | 0.45      | 
| French   | 0.75            | 0.78             | 0.45      | 
| Swedish  | 0.81            | 0.82             | 0.45      | 
| Turkish  | 0.77            | 0.78             | 0.45      | 
 
Micro-averaged F1 scores and optimized prediction thresholds for additional languages (zero-shot):


| Language   | F1 (All labels) | F1 (Main labels) | Threshold |
| ---------- | --------------- | ---------------- | ----------|
| Arabic     | 0.63            | 0.66             | 0.40      |
| Catalan    | 0.62            | 0.63             | 0.50      | 
| Spanish    | 0.62            | 0.67             | 0.65      | 
| Persian    | 0.71            | 0.70             | 0.35      | 
| Hindi      | 0.77            | 0.78             | 0.40      | 
| Indonesian | 0.60            | 0.61             | 0.30      | 
| Japanese   | 0.53            | 0.64             | 0.35      | 
| Norwegian  | 0.65            | 0.70             | 0.65      |
| Portuguese | 0.67            | 0.68             | 0.40      | 
| Urdu       | 0.81            | 0.83             | 0.35      | 
| Chinese    | 0.67            | 0.70             | 0.40      | 

## Technical Specifications

### Compute Infrastructure

- Mahti supercomputer (CSC - IT Center for Science, Finland)
- 1 x NVIDIA A100-SXM4-40GB

#### Software

- torch 2.2.1 
- transformers 4.39.3

## Citation

The citation for this work will be available soon. In the meantime, please refer to earlier related work for citation:

```bibtex
@article{Laippala.etal2022,
  title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}},
  author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo},
  year = {2022},
  journal = {Language Resources and Evaluation},
  issn = {1574-0218},
  doi = {10.1007/s10579-022-09624-1},
  url = {https://doi.org/10.1007/s10579-022-09624-1},
}

@article{Skantsi_Laippala_2023,
  title = {Analyzing the unrestricted web: The finnish corpus of online registers},
  doi = {10.1017/S0332586523000021},
  journal = {Nordic Journal of Linguistics},
  author = {Skantsi, Valtteri and Laippala, Veronika},
  year = {2023},
  pages = {1–31}
}
```

## Model Card Contact

Erik Henriksson, Hugging Face username: erikhenriksson