Kerstin commited on
Commit
4eaea05
1 Parent(s): fc9c81c

Update README.md

Browse files

Add line breaks and correct typos.

Files changed (1) hide show
  1. README.md +35 -11
README.md CHANGED
@@ -11,11 +11,14 @@ license: mit
11
  ---
12
 
13
  # Politweet
14
- In this summer project at Softhouse, we have developed a tool for analyzing Twitter posts by Swedish party leaders. The UI of this tool is in the form of a webpage which lets a user see a graphical representation of the predicted features: topics, sentiments (positive, negative or neutral), and targets in the party leaders' tweets.
 
 
15
 
16
  ### Data Gathering
17
- The tweets were gathered using the Twitter scraping tool [Twint](https://github.com/twintproject/twint).
18
- "An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.".
 
19
 
20
  ### Predicting: Topics, Sentiments & Targets
21
  The classifications that are given by GPT-3 for every tweet contain:
@@ -24,24 +27,45 @@ The classifications that are given by GPT-3 for every tweet contain:
24
  - ```sentiment``` - is the tweet positive, negative or neutral?
25
  - ```target``` - who is the tweet targeting?
26
 
27
- The predicted features were extracted using AI, and more specifically by utilizing the autoregressive language model GPT-3 by [OpenAI](https://openai.com/api/) in combination with our prompt engineering. The final prompt is a result of experimentation while also trying to limit the length of the prompt. Since OpenAI charge users based on "tokens", which is closely related to number of words, it would be economically unsuitable to use lengthy prompts when classifying several thousands of tweets.
 
 
 
 
28
 
29
  ### Merging Topics & Targets
30
- Since the output from GPT-3 varies a lot, e.g. topics can be similar but not identical, a method for clustering similar topics and targets was needed in order to be able to represent statistics of these classifications. Thus, the NN-algorithm was implemented using the cosine similarity as metric, after transforming topics and targets to dense vector representaitons with [Sentence Transformers](https://github.com/UKPLab/sentence-transformers). The similarities between the classification from GPT-3 and words from a predefined set of classes are then calculated, and the classification is changed to the predefined class that yielded the highest cosine similarity. It is worth noting that each predefined class has several "synonyms" or categorical words, and that the highest cosine similarity can be found between the classification and a word from that list.
31
-
32
- Example - The GPT-3 classified topics are: ```main_topic = "sport"``` and ```sub_topic = "soccer"``` -> ```old_topic = "sport and soccer"```, and gets the highest similarity when compared to the word/synonym "soccer" which is in the subset to the predefined class ```"Civil society and sport"``` -> ```new_topic = "Civil society and sport"```
 
 
 
 
 
 
 
 
 
 
 
33
 
34
 
35
  ### Website User Manual
36
- 1. Enter the time period of the tweets that you want to look into - The dates need to be on the format "YYYY-MM-DD", and between 2021-06-28 and 2022-08-10. It is preferable to choose a bigger span rather than a smaller one. Keep in mind that tweets the number of tweets posted by the party leaders will vary a lot (Annie Lööf: 1814 vs Jimmie Åkesson: 185 tweets in total).
 
 
 
37
  2. Select the party leader(s) you want to look into - At least one party leader has to be selected.
38
  3. Select the classifications you want to see statistics of: topic, sentiment and/or target.
39
  4. Apply - always press this button after you check a new box or change the dates to update the website.
40
  5. Run
41
- The piecharts and bar graphs should appear for your selected party leaders. Under the plots, a new panel will appear which lets users see how a prediction was made, i.e. classification from GPT-3 -> the
42
- 6. To see examples of how the topic/sentiment/target was predicted, the user can select a type of classification and check the box "show stats". To download the CSV file containing all tweets and classifications for the checked party leaders, the user can check "Export file". After the selections, Apply and Run.
 
 
 
43
 
44
- ### Dataframe Structure
45
  Each row in the database has the following structure:
46
  ```id,tweet,date,user_id,username,urls,nlikes,nreplies,nretweets,class_tuple,main_topic,sub_topic,sentiment,target,merged_tuple,merged_topic,merged_target,cos_sim_topic,synonym_topic,cos_sim_target,synonym_target```.
47
 
11
  ---
12
 
13
  # Politweet
14
+ In this summer project at Softhouse, we have developed a tool for analyzing Twitter posts by Swedish party leaders.
15
+ The UI of this tool is in the form of a webpage which lets a user see a graphical representation of the predicted
16
+ features: topics, sentiments (positive, negative or neutral), and targets in the party leaders' tweets.
17
 
18
  ### Data Gathering
19
+ The tweets were gathered using the Twitter scraping tool [Twint](https://github.com/twintproject/twint).
20
+ "An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to
21
+ scrape a user's followers, following, Tweets and more while evading most API limitations.".
22
 
23
  ### Predicting: Topics, Sentiments & Targets
24
  The classifications that are given by GPT-3 for every tweet contain:
27
  - ```sentiment``` - is the tweet positive, negative or neutral?
28
  - ```target``` - who is the tweet targeting?
29
 
30
+ The predicted features were extracted using AI, and more specifically by utilizing the autoregressive
31
+ language model GPT-3 by [OpenAI](https://openai.com/api/) in combination with our prompt engineering.
32
+ The final prompt is a result of experimentation while also trying to limit the length of the prompt.
33
+ Since OpenAI charge users based on "tokens", which is closely related to number of words, it would be
34
+ economically unsuitable to use lengthy prompts when classifying several thousands of tweets.
35
 
36
  ### Merging Topics & Targets
37
+ Since the output from GPT-3 varies a lot, e.g. topics can be similar but not identical, a method for
38
+ clustering similar topics and targets was needed in order to be able to represent statistics of these
39
+ classifications. Thus, the NN-algorithm was implemented using the cosine similarity as metric, after
40
+ transforming topics and targets to dense vector representations with
41
+ [Sentence Transformers](https://github.com/UKPLab/sentence-transformers). The similarities between the
42
+ classification from GPT-3 and words from a predefined set of classes are then calculated, and the
43
+ classification is changed to the predefined class that yielded the highest cosine similarity. It is worth
44
+ noting that each predefined class has several "synonyms" or categorical words, and that the highest cosine
45
+ similarity can be found between the classification and a word from that list.
46
+
47
+ Example - The GPT-3 classified topics are: ```main_topic = "sport"``` and ```sub_topic = "soccer"``` ->
48
+ ```old_topic = "sport and soccer"```, and gets the highest similarity when compared to the word/synonym
49
+ "soccer" which is in the subset to the predefined class ```"Civil society and sport"``` ->
50
+ ```new_topic = "Civil society and sport"```
51
 
52
 
53
  ### Website User Manual
54
+ 1. Enter the time period of the tweets that you want to look into - The dates need to be on the format
55
+ "YYYY-MM-DD", and between 2021-06-28 and 2022-08-10. It is preferable to choose a bigger span rather than a
56
+ smaller one. Keep in mind that tweets the number of tweets posted by the party leaders will vary a lot
57
+ (Annie Lööf: 1814 vs Jimmie Åkesson: 185 tweets in total).
58
  2. Select the party leader(s) you want to look into - At least one party leader has to be selected.
59
  3. Select the classifications you want to see statistics of: topic, sentiment and/or target.
60
  4. Apply - always press this button after you check a new box or change the dates to update the website.
61
  5. Run
62
+ The pie charts and bar graphs should appear for your selected party leaders. Under the plots, a new panel
63
+ will appear which lets users see how a prediction was made, i.e. classification from GPT-3 -> the
64
+ 6. To see examples of how the topic/sentiment/target was predicted, the user can select a type of
65
+ classification and check the box "show stats". To download the CSV file containing all tweets and
66
+ classifications for the checked party leaders, the user can check "Export file". After the selections, Apply and Run.
67
 
68
+ ### Data Frame Structure
69
  Each row in the database has the following structure:
70
  ```id,tweet,date,user_id,username,urls,nlikes,nreplies,nretweets,class_tuple,main_topic,sub_topic,sentiment,target,merged_tuple,merged_topic,merged_target,cos_sim_topic,synonym_topic,cos_sim_target,synonym_target```.
71