politweet / README.md
MarcusAscard
Updated the readme file
56b66d5 unverified
|
raw
history blame
6.42 kB

Politweet

In this summer project at Softhouse, we have developed a tool for analyzing Twitter posts by Swedish party leaders. The UI of this tool is in the form of a webpage which lets a user see a graphical representation of the predicted features: topics, sentiments (positive, negative or neutral), and targets in the party leaders' tweets.

Data Gathering

The tweets were gathered using the Twitter scraping tool Twint. "An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.".

Predicting: Topics, Sentiments & Targets

The classifications that are given by GPT-3 for every tweet contain:

  • main_topic - a general topic which the tweet is about
  • sub_topic - a more descriptive topic
  • sentiment - is the tweet positive, negative or neutral?
  • target - who is the tweet targeting?

The predicted features were extracted using AI, and more specifically by utilizing the autoregressive language model GPT-3 by OpenAI in combination with our prompt engineering. The final prompt is a result of experimentation while also trying to limit the length of the prompt. Since OpenAI charge users based on "tokens", which is closely related to number of words, it would be economically unsuitable to use lengthy prompts when classifying several thousands of tweets.

Merging Topics & Targets

Since the output from GPT-3 varies a lot, e.g. topics can be similar but not identical, a method for clustering similar topics and targets was needed in order to be able to represent statistics of these classifications. Thus, the NN-algorithm was implemented using the cosine similarity as metric, after transforming topics and targets to dense vector representaitons with Sentence Transformers. The similarities between the classification from GPT-3 and words from a predefined set of classes are then calculated, and the classification is changed to the predefined class that yielded the highest cosine similarity. It is worth noting that each predefined class has several "synonyms" or categorical words, and that the highest cosine similarity can be found between the classification and a word from that list.

Example - The GPT-3 classified topics are: main_topic = "sport" and sub_topic = "soccer" -> old_topic = "sport and soccer", and gets the highest similarity when compared to the word/synonym "soccer" which is in the subset to the predefined class "Civil society and sport" -> new_topic = "Civil society and sport"

Website User Manual

  1. Enter the time period of the tweets that you want to look into - The dates need to be on the format "YYYY-MM-DD", and between 2021-06-28 and 2022-08-10. It is preferable to choose a bigger span rather than a smaller one. Keep in mind that tweets the number of tweets posted by the party leaders will vary a lot (Annie Lööf: 1814 vs Jimmie Åkesson: 185 tweets in total).
  2. Select the party leader(s) you want to look into - At least one party leader has to be selected.
  3. Select the classifications you want to see statistics of: topic, sentiment and/or target.
  4. Apply - always press this button after you check a new box or change the dates to update the website.
  5. Run The piecharts and bar graphs should appear for your selected party leaders. Under the plots, a new panel will appear which lets users see how a prediction was made, i.e. classification from GPT-3 -> the
  6. To see examples of how the topic/sentiment/target was predicted, the user can select a type of classification and check the box "show stats". To download the CSV file containing all tweets and classifications for the checked party leaders, the user can check "Export file". After the selections, Apply and Run.

Dataframe Structure

Each row in the database has the following structure: id,tweet,date,user_id,username,urls,nlikes,nreplies,nretweets,class_tuple,main_topic,sub_topic,sentiment,target,merged_tuple,merged_topic,merged_target,cos_sim_topic,synonym_topic,cos_sim_target,synonym_target.