MattiaSangermano commited on
Commit
2ccfb5d
·
1 Parent(s): c6cb1ce

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md CHANGED
@@ -1,3 +1,124 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - it
4
+ tags:
5
+ - twitter
6
+ - political-leaning
7
+ - politics
8
+ datasets:
9
+ - politic-it
10
+ widget:
11
+ - text: >-
12
+ È necessario garantire salari dignitosi e condizioni di lavoro adeguate per
13
+ tutelare i diritti dei lavoratori
14
+ example_title: Left-wing example
15
+ - text: >-
16
+ L'immigrazione deve essere gestita con rigore per preservare l'identità
17
+ nazionale!
18
+ example_title: Right-wing example
19
+ model-index:
20
+ - name: bert-political-leaning-it
21
+ results:
22
+ - task:
23
+ type: text-classification
24
+ name: Text Classification
25
+ dataset:
26
+ type: social media
27
+ name: politic-it
28
+ metrics:
29
+ - type: f1 macro
30
+ value: 61.3
31
+ - type: accuracy
32
+ value: 69.4
33
  license: apache-2.0
34
+ metrics:
35
+ - f1
36
+ - accuracy
37
+ pipeline_tag: text-classification
38
  ---
39
+
40
+ # MattiaSangermano/bert-political-leaning-it
41
+
42
+ This model categorizes the political leaning of an Italian sentence into 4 categories: `moderate_left`, `left`, `right`, `moderate_right`. The model is a fine-tuned version of [neuraly/bert-base-italian-cased-sentiment](https://huggingface.co/neuraly/bert-base-italian-cased-sentiment).
43
+
44
+ - **Developed by:** [Mattia Sangermano](https://www.linkedin.com/in/mattia-sangermano/) and [Fabio Murgese](https://www.linkedin.com/in/fabio-murgese/)
45
+ - **Model type:** Bert
46
+ - **Language(s) (NLP):** it
47
+ - **License:** Apache 2.0
48
+
49
+ ### How to Get Started with the Model
50
+
51
+ You can use this model directly with a pipeline for text classification:
52
+
53
+ ``` python
54
+ from transformers import pipeline
55
+ classifier = pipeline("text-classification",model='MattiaSangermano/bert-political-leaning-it')
56
+ prediction = classifier("Sovranità nazionale e identità forte")
57
+ print(prediction)
58
+ ```
59
+
60
+ Here is how to use this model to classify a text in PyTorch:
61
+
62
+ ``` python
63
+ from transformers import BertForSequenceClassification, AutoTokenizer
64
+ import torch
65
+ tokenizer = AutoTokenizer.from_pretrained('MattiaSangermano/bert-political-leaning-it')
66
+ model = BertForSequenceClassification.from_pretrained('MattiaSangermano/bert-political-leaning-it')
67
+ tokens = tokenizer("Uguaglianza e giustizia sociale", return_tensors='pt')
68
+ logits = model(**tokens)[0]
69
+ prediction = model.config.id2label[torch.argmax(logits).item()]
70
+ print(prediction)
71
+ ```
72
+
73
+ and in TensorFlow:
74
+
75
+ ``` python
76
+ from transformers import AutoTokenizer, TFBertForSequenceClassification
77
+ import tensorflow as tf
78
+ tokenizer = AutoTokenizer.from_pretrained('MattiaSangermano/bert-political-leaning-it')
79
+ model = TFBertForSequenceClassification.from_pretrained('MattiaSangermano/bert-political-leaning-it')
80
+ tokens = tokenizer("Ambiente sano, futuro sicuro", padding=True,truncation=True,return_tensors='tf')
81
+ logits = model(tokens)[0]
82
+ prediction = model.config.id2label[tf.argmax(logits,1)[0].numpy()]
83
+ print(prediction)
84
+ ```
85
+
86
+
87
+ ### Out-of-Scope Use
88
+
89
+ It is important to recognize that political leaning is a personal and complex aspect of an individual's identity and attempting to classify it can be considered unethical and raise significant concerns. Therefore, the model should not be used to identify or classify the political orientation of individual users, nor should it be used for unethical purposes.
90
+
91
+ ## Bias, Risks, and Limitations
92
+ During the construction of the dataset, deliberate efforts were made to exclude the names of politicians and political parties. As a result, these specific names might not hold relevance to the model.
93
+
94
+
95
+ ## Dataset
96
+
97
+ We trained the model using the [PoliticIT](https://codalab.lisn.upsaclay.fr/competitions/8507#learn_the_details) competition dataset. The dataset was collected during 2020 and 2022 from the Twitter accounts of Italian politicians. These users were selected because their political affiliation can be guessed according to the party to which politicians belong to. The goal of the task was to classify a cluster of tweets, where a cluster is composed of texts written by different users that share the user self-assigned gender and the political ideology.
98
+
99
+ ### Preprocessing
100
+
101
+ According to PoliticIT mantainers, from the dataset were discarded those tweets that contain mentions to news sites or some linguistic clues, such as the pipe symbol, which is used commonly by news sites to categorise their news. Moreover, the Twitter mentions were anonymised by replacing them with the token @user. Therefore the text traits cannot be guessed trivially by reading polititian's name and searching information on them on the Internet. Overall, the dataset consists of 103840 tweets.
102
+
103
+ #### Training Procedure
104
+
105
+ The Dataset was split into train and validation sets with a stratified split having a ratio of 80-20. Although the main task of the original competition was to classify clusters of tweets this model was trained to predict only the political leaning of individual tweets.
106
+
107
+ ### Training Hyperparameters
108
+
109
+ - *Optimizer*: **Adam** with learning rate of **4e-5**, epsilon of **1e-7**
110
+ - *Loss*: **Categorical Cross Entropy** using **balanced** class weights
111
+ - *Max epochs*: **10**
112
+ - *Batch size*: **64**
113
+ - *Early Stopping*: monitoring validation loss with patience = **3**
114
+ - *Training regime*: fp16 mixed precision
115
+
116
+ ## Evaluation
117
+ - test **f1-macro**: 61.3
118
+ - test **accuracy**: 69.4
119
+
120
+
121
+ | Avg Type | Precision | Recall | F1-score | Accuracy |
122
+ | ------ | ------ | ------ | ------ | ------ |
123
+ | Macro | 0.67 | 0.61 | 0.61 | - |
124
+ | Weighted | 0.74 | 0.69 | 0.77 | 0.69 |