Spaces:
Runtime error
Runtime error
title: Politweet | |
emoji: 📉 | |
colorFrom: pink | |
colorTo: green | |
sdk: gradio | |
sdk_version: 3.0.26 | |
app_file: app.py | |
pinned: false | |
license: mit | |
# Politweet | |
In this summer project at Softhouse, we have developed a tool for analyzing Twitter posts by Swedish party leaders. | |
The UI of this tool is in the form of a webpage which lets a user see a graphical representation of the predicted | |
features: topics, sentiments (positive, negative or neutral), and targets in the party leaders' tweets. | |
### Data Gathering | |
The tweets were gathered using the Twitter scraping tool [Twint](https://github.com/twintproject/twint). | |
"An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to | |
scrape a user's followers, following, Tweets and more while evading most API limitations.". | |
### Predicting: Topics, Sentiments & Targets | |
The classifications that are given by GPT-3 for every tweet contain: | |
- ```main_topic``` - a general topic which the tweet is about | |
- ```sub_topic``` - a more descriptive topic | |
- ```sentiment``` - is the tweet positive, negative or neutral? | |
- ```target``` - who is the tweet targeting? | |
The predicted features were extracted using AI, and more specifically by utilizing the autoregressive | |
language model GPT-3 by [OpenAI](https://openai.com/api/) in combination with our prompt engineering. | |
The final prompt is a result of experimentation while also trying to limit the length of the prompt. | |
Since OpenAI charge users based on "tokens", which is closely related to number of words, it would be | |
economically unsuitable to use lengthy prompts when classifying several thousands of tweets. | |
### Merging Topics & Targets | |
Since the output from GPT-3 varies a lot, e.g. topics can be similar but not identical, a method for | |
clustering similar topics and targets was needed in order to be able to represent statistics of these | |
classifications. Thus, the NN-algorithm was implemented using the cosine similarity as metric, after | |
transforming topics and targets to dense vector representations with | |
[Sentence Transformers](https://github.com/UKPLab/sentence-transformers). The similarities between the | |
classification from GPT-3 and words from a predefined set of classes are then calculated, and the | |
classification is changed to the predefined class that yielded the highest cosine similarity. It is worth | |
noting that each predefined class has several "synonyms" or categorical words, and that the highest cosine | |
similarity can be found between the classification and a word from that list. | |
Example - The GPT-3 classified topics are: ```main_topic = "sport"``` and ```sub_topic = "soccer"``` -> | |
```old_topic = "sport and soccer"```, and gets the highest similarity when compared to the word/synonym | |
"soccer" which is in the subset to the predefined class ```"Civil society and sport"``` -> | |
```new_topic = "Civil society and sport"``` | |
### Website User Manual | |
1. Enter the time period of the tweets that you want to look into - The dates need to be on the format | |
"YYYY-MM-DD", and between 2021-06-28 and 2022-08-10. It is preferable to choose a bigger span rather than a | |
smaller one. Keep in mind that tweets the number of tweets posted by the party leaders will vary a lot | |
(Annie Lööf: 1814 vs Jimmie Åkesson: 185 tweets in total). | |
2. Select the party leader(s) you want to look into - At least one party leader has to be selected. | |
3. Select the classifications you want to see statistics of: topic, sentiment and/or target. | |
4. Apply - always press this button after you check a new box or change the dates to update the website. | |
5. Run | |
The pie charts and bar graphs should appear for your selected party leaders. Under the plots, a new panel | |
will appear which lets users see how a prediction was made, i.e. classification from GPT-3 -> the | |
6. To see examples of how the topic/sentiment/target was predicted, the user can select a type of | |
classification and check the box "show stats". To download the CSV file containing all tweets and | |
classifications for the checked party leaders, the user can check "Export file". After the selections, Apply and Run. | |
### Data Frame Structure | |
Each row in the database has the following structure: | |
```id,tweet,date,user_id,username,urls,nlikes,nreplies,nretweets,class_tuple,main_topic,sub_topic,sentiment,target,merged_tuple,merged_topic,merged_target,cos_sim_topic,synonym_topic,cos_sim_target,synonym_target```. | |
<!--- | |
### For Developers | |
För att utveckla lokalt: | |
1. Skapa en ny branch https://www.howtogeek.com/714112/how-to-create-a-new-branch-in-github/ | |
a. nämn den till dittnamn_dev | |
2. lokalt kan ni klona branchen med 'git clone https url'. Mer info: https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository | |
3. Ni kommer få knappa in er UserName och Lösenord. Detta Lösenord är INTE samma som ni loggar in, utan är ett temporärt Personal Authentication Token som hämtas från https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token | |
4. För att inte alltid behöva knappa in den långa strängen, följ https://medium.com/ci-cd-devops/how-to-cache-your-personal-access-token-pat-in-linux-environment-97791424eb83 för att lagra den i din maskin. | |
5. Nu går det bra att utveckla lokalt och pusha upp till remote server i samma branch. | |
To get all the dependencies: | |
1. Create a virtual environment: https://docs.python.org/3/library/venv.html | |
2. Activate your virtual environment | |
3. Go to the root path of the project and type in the terminal: | |
$ pip install -r requirements.txt | |
4. In some cases twint does not install properly for Ubutnu. After setting everything up, this was solved by typing "sudo apt-get install build-essential" in the terminal | |
5. In order to use OpenAI you need an authorization token. This is created by creating an 'env.' file to the root path of the project. | |
6. Type following in that file: | |
OPENAI_AUTHTOKEN = your open-ai token | |
7. The python file TextClassifier should now be able to use OpenAI, given that your token is not used up. | |
![Flowcharts(4)](https://user-images.githubusercontent.com/44498515/178435693-bd86c2b6-23ac-4b69-94ae-366553468502.png) | |
För att köra, skriv i terminalen. | |
1. '$ python app.py' | |
2. eller '$ make' ---> | |