File size: 6,337 Bytes
cc5cf02
 
 
 
 
 
 
 
 
 
c0e33cf
56b66d5
 
4eaea05
 
 
56b66d5
 
4eaea05
 
 
56b66d5
 
 
 
 
 
 
 
4eaea05
 
 
 
 
cc5cf02
56b66d5
4eaea05
 
 
 
 
 
 
 
 
cc5cf02
4eaea05
 
 
 
5af42a3
 
56b66d5
4eaea05
 
 
 
56b66d5
 
 
 
4eaea05
 
 
 
 
56b66d5
4eaea05
56b66d5
 
 
 
 
 
5af42a3
 
 
99e6e38
5af42a3
99e6e38
5af42a3
 
 
 
 
 
220bb13
 
e6bef31
220bb13
e6bef31
 
 
ad1a1ba
e6bef31
 
 
 
 
ad1a1ba
9de6496
 
 
 
d0ae49e
a0098c9
c8cc130
b0ba65a
 
56b66d5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: Politweet
emoji: 📉
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 3.0.26
app_file: app.py
pinned: false
license: mit
---

# Politweet
In this summer project at Softhouse, we have developed a tool for analyzing Twitter posts by Swedish party leaders. 
The UI of this tool is in the form of a webpage which lets a user see a graphical representation of the predicted 
features: topics, sentiments (positive, negative or neutral), and targets in the party leaders' tweets. 

### Data Gathering
The tweets were gathered using the Twitter scraping tool [Twint](https://github.com/twintproject/twint).  
"An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to 
scrape a user's followers, following, Tweets and more while evading most API limitations.".

### Predicting: Topics, Sentiments & Targets
The classifications that are given by GPT-3 for every tweet contain:
- ```main_topic``` - a general topic which the tweet is about
- ```sub_topic``` - a more descriptive topic 
- ```sentiment``` - is the tweet positive, negative or neutral?
- ```target``` - who is the tweet targeting?

The predicted features were extracted using AI, and more specifically by utilizing the autoregressive 
language model GPT-3 by [OpenAI](https://openai.com/api/) in combination with our prompt engineering. 
The final prompt is a result of experimentation while also trying to limit the length of the prompt. 
Since OpenAI charge users based on "tokens", which is closely related to number of words, it would be 
economically unsuitable to use lengthy prompts when classifying several thousands of tweets.

### Merging Topics & Targets
Since the output from GPT-3 varies a lot, e.g. topics can be similar but not identical, a method for 
clustering similar topics and targets was needed in order to be able to represent statistics of these 
classifications. Thus, the NN-algorithm was implemented using the cosine similarity as metric, after 
transforming topics and targets to dense vector representations with 
[Sentence Transformers](https://github.com/UKPLab/sentence-transformers). The similarities between the 
classification from GPT-3 and words from a predefined set of classes are then calculated, and the 
classification is changed to the predefined class that yielded the highest cosine similarity. It is worth 
noting that each predefined class has several "synonyms" or categorical words, and that the highest cosine 
similarity can be found between the classification and a word from that list. 

Example - The GPT-3 classified topics are: ```main_topic = "sport"``` and ```sub_topic = "soccer"``` -> 
```old_topic = "sport and soccer"```, and gets the highest similarity when compared to the word/synonym 
"soccer" which is in the subset to the predefined class ```"Civil society and sport"``` -> 
```new_topic = "Civil society and sport"```


### Website User Manual 
1. Enter the time period of the tweets that you want to look into - The dates need to be on the format 
"YYYY-MM-DD", and between 2021-06-28 and 2022-08-10. It is preferable to choose a bigger span rather than a 
smaller one. Keep in mind that tweets the number of tweets posted by the party leaders will vary a lot 
(Annie Lööf: 1814 vs Jimmie Åkesson: 185 tweets in total). 
2. Select the party leader(s) you want to look into - At least one party leader has to be selected.
3. Select the classifications you want to see statistics of: topic, sentiment and/or target.
4. Apply - always press this button after you check a new box or change the dates to update the website.
5. Run
The pie charts and bar graphs should appear for your selected party leaders. Under the plots, a new panel 
will appear which lets users see how a prediction was made, i.e. classification from GPT-3 -> the   
6. To see examples of how the topic/sentiment/target was predicted, the user can select a type of 
classification and check the box "show stats". To download the CSV file containing all tweets and 
classifications for the checked party leaders, the user can check "Export file". After the selections, Apply and Run.

### Data Frame Structure
Each row in the database has the following structure:
```id,tweet,date,user_id,username,urls,nlikes,nreplies,nretweets,class_tuple,main_topic,sub_topic,sentiment,target,merged_tuple,merged_topic,merged_target,cos_sim_topic,synonym_topic,cos_sim_target,synonym_target```. 


  <!--- 
### For Developers
För att utveckla lokalt:

1. Skapa en ny branch https://www.howtogeek.com/714112/how-to-create-a-new-branch-in-github/
   a. nämn den till dittnamn_dev
   
2. lokalt kan ni klona branchen med 'git clone https url'. Mer info: https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository

3. Ni kommer få knappa in er UserName och Lösenord. Detta Lösenord är INTE samma som ni loggar in, utan är ett temporärt Personal Authentication Token som hämtas från https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token

4. För att inte alltid behöva knappa in den långa strängen, följ https://medium.com/ci-cd-devops/how-to-cache-your-personal-access-token-pat-in-linux-environment-97791424eb83 för att lagra den i din maskin. 

5. Nu går det bra att utveckla lokalt och pusha upp till remote server i samma branch. 


To get all the dependencies:

1. Create a virtual environment: https://docs.python.org/3/library/venv.html
2. Activate your virtual environment
3. Go to the root path of the project and type in the terminal:
      $ pip install -r requirements.txt
4. In some cases twint does not install properly for Ubutnu. After setting everything up, this was solved by typing "sudo apt-get install build-essential" in the terminal
5. In order to use OpenAI you need an authorization token. This is created by creating an 'env.' file to the root path of the project.
6. Type following in that file:
      OPENAI_AUTHTOKEN = your open-ai token
7. The python file TextClassifier should now be able to use OpenAI, given that your token is not used up.






![Flowcharts(4)](https://user-images.githubusercontent.com/44498515/178435693-bd86c2b6-23ac-4b69-94ae-366553468502.png)

För att köra, skriv i terminalen.
1. '$ python app.py'
2. eller '$ make'  --->