title: Politweet
# Politweet
In this summer project at Softhouse, we have developed a tool for analyzing Twitter posts by Swedish party leaders. The UI of this tool is in the form of a webpage which lets a user see a graphical representation of the predicted features: topics, sentiments (positive, negative or neutral), and targets in the party leaders' tweets.
### Data Gathering
The tweets were gathered using the Twitter scraping tool [Twint](https://github.com/twintproject/twint).
"An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.".
### Predicting: Topics, Sentiments & Targets
The classifications that are given by GPT-3 for every tweet contain:
- ```main_topic``` - a general topic which the tweet is about
- ```sub_topic``` - a more descriptive topic
- ```sentiment``` - is the tweet positive, negative or neutral?
- ```target``` - who is the tweet targeting?
The predicted features were extracted using AI, and more specifically by utilizing the autoregressive language model GPT-3 by [OpenAI](https://openai.com/api/) in combination with our prompt engineering. The final prompt is a result of experimentation while also trying to limit the length of the prompt. Since OpenAI charge users based on "tokens", which is closely related to number of words, it would be economically unsuitable to use lengthy prompts when classifying several thousands of tweets.
### Merging Topics & Targets
Since the output from GPT-3 varies a lot, e.g. topics can be similar but not identical, a method for clustering similar topics and targets was needed in order to be able to represent statistics of these classifications. Thus, the NN-algorithm was implemented using the cosine similarity as metric, after transforming topics and targets to dense vector representaitons with [Sentence Transformers](https://github.com/UKPLab/sentence-transformers). The similarities between the classification from GPT-3 and words from a predefined set of classes are then calculated, and the classification is changed to the predefined class that yielded the highest cosine similarity. It is worth noting that each predefined class has several "synonyms" or categorical words, and that the highest cosine similarity can be found between the classification and a word from that list.
Example - The GPT-3 classified topics are: ```main_topic = "sport"``` and ```sub_topic = "soccer"``` -> ```old_topic = "sport and soccer"```, and gets the highest similarity when compared to the word/synonym "soccer" which is in the subset to the predefined class ```"Civil society and sport"``` -> ```new_topic = "Civil society and sport"```
### Website User Manual
1. Enter the time period of the tweets that you want to look into - The dates need to be on the format "YYYY-MM-DD", and between 2021-06-28 and 2022-08-10. It is preferable to choose a bigger span rather than a smaller one. Keep in mind that tweets the number of tweets posted by the party leaders will vary a lot (Annie Lööf: 1814 vs Jimmie Åkesson: 185 tweets in total).
2. Select the party leader(s) you want to look into - At least one party leader has to be selected.
3. Select the classifications you want to see statistics of: topic, sentiment and/or target.
4. Apply - always press this button after you check a new box or change the dates to update the website.
5. Run
The piecharts and bar graphs should appear for your selected party leaders. Under the plots, a new panel will appear which lets users see how a prediction was made, i.e. classification from GPT-3 -> the
6. To see examples of how the topic/sentiment/target was predicted, the user can select a type of classification and check the box "show stats". To download the CSV file containing all tweets and classifications for the checked party leaders, the user can check "Export file". After the selections, Apply and Run.
### Dataframe Structure
Each row in the database has the following structure:
### For Developers
För att utveckla lokalt:
1. Skapa en ny branch https://www.howtogeek.com/714112/how-to-create-a-new-branch-in-github/
a. nämn den till dittnamn_dev
2. lokalt kan ni klona branchen med 'git clone https url'. Mer info: https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository
3. Ni kommer få knappa in er UserName och Lösenord. Detta Lösenord är INTE samma som ni loggar in, utan är ett temporärt Personal Authentication Token som hämtas från https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
4. För att inte alltid behöva knappa in den långa strängen, följ https://medium.com/ci-cd-devops/how-to-cache-your-personal-access-token-pat-in-linux-environment-97791424eb83 för att lagra den i din maskin.
5. Nu går det bra att utveckla lokalt och pusha upp till remote server i samma branch.
För att få alla dependencies:
1. skapa en virtual environment: https://docs.python.org/3/library/venv.html
2. Aktivera din virtual environment
3. gå till projektets root path och skriv i terminalen:
$ pip install -r requirements.txt
4. I vissa fall funkar det inte att installera twint för Ubuntu. Efter att ha ställt in allt funkade det efter att ha kört "sudo apt-get install build- essential" i terminalen.
5. För att använda openai behövs en auktoriserings-token. Detta skapas genom att skapa en '.env' fil i projektets root path.
6. Skriv in följande i den filen:
OPENAI_AUTHTOKEN=din open-ai token
7. Nu borde TextClassifier kunna använda openai, givet att du har timmar att lägga till din token.
För att köra, skriv i terminalen.
1. '$ python app.py'
2. eller '$ make' --->