File size: 5,811 Bytes
1c9d986
 
 
 
 
 
 
 
 
 
4678546
 
 
1c9d986
 
d833608
 
 
 
4678546
1c9d986
 
d833608
 
1c9d986
 
 
 
 
 
 
 
 
 
d833608
1c9d986
5e3a301
 
 
 
 
 
 
1c9d986
d833608
1c9d986
 
 
 
 
 
 
 
 
d833608
1c9d986
 
 
 
 
 
 
 
 
d833608
1c9d986
 
 
 
 
 
 
 
 
4678546
1c9d986
 
 
 
 
402e4d0
 
 
 
 
 
 
 
 
 
1c9d986
 
402e4d0
 
 
 
 
 
 
 
1c9d986
 
402e4d0
 
 
 
 
 
 
 
 
 
 
 
 
 
1c9d986
 
4f942f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c9d986
 
 
4678546
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
language:
- en
- fr
- it
- pt
tags:
- formal or informal classification
licenses:
- cc-by-nc-sa
license: openrail++
base_model:
- FacebookAI/xlm-roberta-base
---

**Model Overview**

This is the model presented in the paper ["Detecting Text Formality: A Study of Text Classification Approaches"](https://aclanthology.org/2023.ranlp-1.31/). 

XLM-Roberta-based classifier trained on [XFORMAL](https://aclanthology.org/2021.naacl-main.256.bib) -- a multilingual formality classification dataset.


**Results**
All languages

|              | precision | recall   | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0            | 0.744912  | 0.927790 | 0.826354 | 108019  |
| 1            | 0.889088  | 0.645630 | 0.748048 | 96845   |
| accuracy     |           |          | 0.794405 | 204864  |
| macro avg    | 0.817000  | 0.786710 | 0.787201 | 204864  |
| weighted avg | 0.813068  | 0.794405 | 0.789337 | 204864  |


EN

|              | precision | recall   | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0            | 0.800053  | 0.962981 | 0.873988 | 22151   |
| 1            | 0.945106  | 0.725899 | 0.821124 | 19449   |
| accuracy     |           |          | 0.852139 | 41600   |
| macro avg    | 0.872579  | 0.844440 | 0.847556 | 41600   |
| weighted avg | 0.867869  | 0.852139 | 0.849273 | 41600   |

FR

|              | precision | recall   | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0            | 0.746709  | 0.925738 | 0.826641 | 21505   |
| 1            | 0.887305  | 0.650592 | 0.750731 | 19327   |
| accuracy     |           |          | 0.795504 | 40832   |
| macro avg    | 0.817007  | 0.788165 | 0.788686 | 40832   |
| weighted avg | 0.813257  | 0.795504 | 0.790711 | 40832   |

IT

|              | precision | recall   | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0            | 0.721282  | 0.914669 | 0.806545 | 21528   |
| 1            | 0.864887  | 0.607135 | 0.713445 | 19368   |
| accuracy     |           |          | 0.769024 | 40896   |
| macro avg    | 0.793084  | 0.760902 | 0.759995 | 40896   |
| weighted avg | 0.789292  | 0.769024 | 0.762454 | 40896   |

PT

|              | precision | recall   | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0            | 0.717546  | 0.908167 | 0.801681 | 21637   |
| 1            | 0.853628  | 0.599700 | 0.704481 | 19323   |
| accuracy     |           |          | 0.762646 | 40960   |
| macro avg    | 0.785587  | 0.753933 | 0.753081 | 40960   |
| weighted avg | 0.781743  | 0.762646 | 0.755826 | 40960   |


## How to use
```python
from transformers import XLMRobertaTokenizerFast, XLMRobertaForSequenceClassification

# load tokenizer and model weights
tokenizer = XLMRobertaTokenizerFast.from_pretrained('s-nlp/xlmr_formality_classifier')
model = XLMRobertaForSequenceClassification.from_pretrained('s-nlp/xlmr_formality_classifier')

id2formality = {0: "formal", 1: "informal"}
texts = [
    "I like you. I love you",
    "Hey, what's up?",
    "Siema, co porabiasz?",
    "I feel deep regret and sadness about the situation in international politics.",
]

# prepare the input
encoding = tokenizer(
    texts,
    add_special_tokens=True,
    return_token_type_ids=True,
    truncation=True,
    padding="max_length",
    return_tensors="pt",
)

# inference
output = model(**encoding)

formality_scores = [
    {id2formality[idx]: score for idx, score in enumerate(text_scores.tolist())}
    for text_scores in output.logits.softmax(dim=1)
]
formality_scores
```

```
[{'formal': 0.993225634098053, 'informal': 0.006774314679205418},
 {'formal': 0.8807966113090515, 'informal': 0.1192033663392067},
 {'formal': 0.936184287071228, 'informal': 0.06381577253341675},
 {'formal': 0.9986615180969238, 'informal': 0.0013385231141000986}]
```

## Citation

```
@inproceedings{dementieva-etal-2023-detecting,
    title = "Detecting Text Formality: A Study of Text Classification Approaches",
    author = "Dementieva, Daryna  and
      Babakov, Nikolay  and
      Panchenko, Alexander",
    editor = "Mitkov, Ruslan  and
      Angelova, Galia",
    booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.ranlp-1.31",
    pages = "274--284",
    abstract = "Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation{---}GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments {--} monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.",
}
```


## Licensing Information

This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good.