File size: 5,715 Bytes
fd07a91
 
 
 
 
 
 
 
 
 
 
 
3e57b17
 
 
 
 
71cc14d
ef00c30
 
843b7e8
 
 
2f11021
843b7e8
afc547a
71cc14d
c877f97
6ca647b
4c01bb2
cbe72b1
 
 
 
 
 
 
 
 
 
 
 
 
 
4c01bb2
6ca647b
 
cbe72b1
6ca647b
 
 
 
 
 
 
 
 
71cc14d
6ca647b
 
 
c877f97
6ca647b
 
 
 
 
 
 
 
 
5f65a48
6ca647b
71cc14d
812b3ee
 
 
3c3e1a4
71cc14d
3c3e1a4
 
 
 
 
 
 
 
9090130
3bb29b0
 
 
 
3c3e1a4
 
 
775330b
6e51da4
3bb29b0
 
bb071b8
4b8f830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef00c30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: apache-2.0
language:
- bs
- hr
- sr
- sl
- sk
- cs
- en
tags:
- sentiment-analysis
- text-regression
- text-classification
- sentiment-regression
- sentiment-classification
- parliament
inference: false
datasets:
- classla/ParlaSent
---


# Multilingual parliament sentiment regression model XLM-R-ParlaSent

This model is based on [xlm-r-parla](https://huggingface.co/classla/xlm-r-parla), an XLM-R-large model additionally pre-trained on parliamentary proceedings. The model was fine-tuned on the [ParlaSent dataset](http://hdl.handle.net/11356/1868), a manually annotated selection of sentences of parliamentary proceedings from Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom.

Both the additionally pre-trained model, as well as [the training dataset](http://hdl.handle.net/11356/1868) are results of the [ParlaMint project](https://www.clarin.eu/parlamint). The details on the models and the dataset are described in our [paper](https://arxiv.org/abs/2309.09783):

```latex
@article{
 Mochtak_Rupnik_Ljubešić_2023,
 title={The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings},
 rights={All rights reserved},
 url={http://arxiv.org/abs/2309.09783},
 abstractNote={Sentiments inherently drive politics. How we receive and process information plays an essential role in political decision-making, shaping our judgment with strategic consequences both on the level of legislators and the masses. If sentiment plays such an important role in politics, how can we study and measure it systematically? The paper presents a new dataset of sentiment-annotated sentences, which are used in a series of experiments focused on training a robust sentiment classifier for parliamentary proceedings. The paper also introduces the first domain-specific LLM for political science applications additionally pre-trained on 1.72 billion domain-specific words from proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training of LLM on parliamentary data can significantly improve the model downstream performance on the domain-specific tasks, in our case, sentiment detection in parliamentary proceedings. We further show that multilingual models perform very well on unseen languages and that additional data from other languages significantly improves the target parliament’s results. The paper makes an important contribution to multiple domains of social sciences and bridges them with computer science and computational linguistics. Lastly, it sets up a more robust approach to sentiment analysis of political texts in general, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.},
 note={arXiv:2309.09783 [cs]},
 number={arXiv:2309.09783},
 publisher={arXiv},
 author={Mochtak, Michal and Rupnik, Peter and Ljubešić, Nikola},
 year={2023},
 month={Sep},
 language={en}
}
```
## Annotation schema

The discrete labels, present in the [ParlaSent dataset](http://hdl.handle.net/11356/1868), were mapped to integers as follows:

```
  "Negative": 0.0,
  "M_Negative": 1.0,
  "N_Neutral": 2.0,
  "P_Neutral": 3.0,
  "M_Positive": 4.0,
  "Positive": 5.0,
```
The model was then fine-tuned on numeric labels and set up as a regressor.

## Finetuning procedure

The fine-tuning procedure is described in the paper, cited above. Presumed optimal hyperparameters used are
```
  num_train_epochs=4,
  train_batch_size=32,
  learning_rate=8e-6,
  regression=True
```

## Results

Results reported were obtained from 5 fine-tuning runs.

test dataset | R^2 | MAE
--- | --- | ---
BCS | 0.6146 ± 0.0104 | 0.7050 ± 0.0089
EN | 0.6722 ± 0.0100 | 0.6755 ± 0.0076

## Usage Example

With `simpletransformers==0.64.3`.
```python
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import torch
model_args = ClassificationArgs(
        regression=True,
    )
model = ClassificationModel(model_type="xlmroberta", model_name="classla/xlm-r-parlasent",use_cuda=torch.cuda.is_available(), num_labels=1,args=model_args)
model.predict(["I fully disagree with this argument.", 
               "The ministers are entering the chamber.",
               "Things can always be improved in the future.",
               "These are great news."])
```

Output:
```python
(
  array([0.11633301, 3.63671875, 4.203125, 5.30859375]),
  array([0.11633301, 3.63671875, 4.203125, 5.30859375])
)
```
## Large scale use

[Bojan](https://huggingface.co/Bojan) tested the example above on a large dataset. He reports execution time can be improved by a factor of five with the use of `transformers` as follows:

```python

from transformers import AutoModelForSequenceClassification, TextClassificationPipeline, AutoTokenizer,                                                                                               AutoConfig

MODEL = "classla/xlm-r-parlasent"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True,
                                      task='sentiment_analysis', device=0, function_to_apply="none")
pipe([
    "I fully disagree with this argument.", 
    "The ministers are entering the chamber.",
    "Things can always be improved in the future.",
    "These are great news."
])
```