File size: 4,258 Bytes
2344e98
 
2c4441c
 
 
 
 
 
2344e98
2c4441c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
496c000
2c4441c
 
 
 
85fb119
2c4441c
 
 
 
a71c886
 
2c4441c
 
 
 
a71c886
2c4441c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: cc-by-sa-4.0
datasets:
- cjvt/cc_gigafida
language:
- sl
tags:
- word case classification
---

---
language: 
- sl

license: cc-by-sa-4.0
---

# sloberta-word-case-classification-multilabel

SloBERTa model finetuned on the Gigafida dataset for word case classification.  

The input to the model is expected to be **fully lowercased text**.
The model classifies whether the input words should stay lowercased, be uppercased, or be all-uppercased. In addition, it provides a constrained explanation for its case classification.
See usage example below for more details.  

## Usage example
Imagine we have the following Slovenian text. Asterisked words have an incorrect word casing. 
```
Linus *torvalds* je *Finski* programer, Poznan kot izumitelj operacijskega sistema Linux.
(EN: Linus Torvalds is a Finnish programer, known as the inventor of the Linux operating sistem)
```

The model expects an all-lowercased input, so we pass it the following text:
```
linus torvalds je finski programer, poznan kot izumitelj operacijskega sistema linux.
```

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):  
```
linus -> UPPER_ENTITY, UPPER_BEGIN
torvalds -> UPPER_ENTITY
je -> LOWER_OTHER
finski -> LOWER_ADJ_SKI
programer -> LOWER_OTHER
, -> LOWER_OTHER
poznan -> LOWER_HYPERCORRECTION
kot -> LOWER_OTHER
izumitelj -> LOWER_OTHER
operacijskega -> LOWER_OTHER
sistema -> LOWER_OTHER
linux -> UPPER_ENTITY
```

Then we would compare the (coarse) predictions (i.e., LOWER/UPPER/UPPER_ALLUC) with the initial casing and observe the following:
- `Torvalds` is originally lowercased, but the model corrects it to uppercase (because it is an entity),
- `finski` is originally uppercased, but the model corrects it to lowercase (because it is an adjective with suffix -ski),
- `poznan` is originally uppercased, but the model corrects it to lowercase (the model assumes that the user made the mistake due to hypercorrection, meaning they naïvely uppercased a word after a character that could be punctuation),

The other predictions agree with the word case in the initial text, so they are assumed to be correct.


## More details
More concretely, the model is a 12-class multi-label classifier with the following class indices and interpretations:  
```
0: "LOWER_OTHER",  # lowercased for an uncaptured reason
1: "LOWER_HYPERCORRECTION",  # lowercase due to hypercorrection (e.g., user automatically uppercased a word after a "." despite it not being a punctuation mark - the word should instead be lowercased)
2: "LOWER_ADJ_SKI",  # lowercased because the word is an adjective ending in suffix -ski
3: "LOWER_ENTITY_PART",  # lowercased word that is part of an entity (e.g., "Novo **mesto**")
4: "UPPER_OTHER",  # upercased for an uncaptured reason
5: "UPPER_BEGIN",  # upercased because the word begins a sentence
6: "UPPER_ENTITY",  # uppercased word that is part of an entity
7: "UPPER_DIRECT_SPEECH",  # upercased word due to direct speech
8: "UPPER_ADJ_OTHER",  # upercased adjective for an uncaptured reason (usually this is a possesive adjective)
9: "UPPER_ALLUC_OTHER",  # all-uppercased for an uncaptured reason
10: "UPPER_ALLUC_BEGIN",  # all-uppercased because the word begins a sentence
11: "UPPER_ALLUC_ENTITY"  # all-uppercased because the word is part of an entity
```

As the model is trained for multi-label classification, a word can be assigned multiple labels whose probability is > T. Naïvely T=0.5 can be used, but it is slightly better to use label thresholds optimized on a small validation set - 
they are noted in the file `label_thresholds.json` and below (along with the validation set F1 achieved with the best threshold).  

```
LOWER_OTHER: T=0.4500 -> F1 =  0.9965
LOWER_HYPERCORRECTION: T=0.5800 -> F1 =  0.8555
LOWER_ADJ_SKI: T=0.4810 -> F1 =  0.9863
LOWER_ENTITY_PART: T=0.4330 -> F1 =  0.8024
UPPER_OTHER: T=0.4460 -> F1 =  0.7538
UPPER_BEGIN: T=0.4690 -> F1 =  0.9905
UPPER_ENTITY: T=0.5030 -> F1 =  0.9670
UPPER_DIRECT_SPEECH: T=0.4170 -> F1 =  0.9852
UPPER_ADJ_OTHER: T=0.5080 -> F1 =  0.9431
UPPER_ALLUC_OTHER: T=0.4850 -> F1 =  0.8463
UPPER_ALLUC_BEGIN: T=0.5170 -> F1 =  0.9798
UPPER_ALLUC_ENTITY: T=0.4490 -> F1 =  0.9391
```