Spaces:

politweet-sh
/

politweet

Runtime error

App Files Files Community

politweet / README.md

olofbengtsson

Translated how to get dependencies

e6bef31 about 2 years ago

preview code

raw

history blame contribute delete

No virus

6.34 kB

	---
	title: Politweet
	emoji: 📉
	colorFrom: pink
	colorTo: green
	sdk: gradio
	sdk_version: 3.0.26
	app_file: app.py
	pinned: false
	license: mit
	---

	# Politweet
	In this summer project at Softhouse, we have developed a tool for analyzing Twitter posts by Swedish party leaders.
	The UI of this tool is in the form of a webpage which lets a user see a graphical representation of the predicted
	features: topics, sentiments (positive, negative or neutral), and targets in the party leaders' tweets.

	### Data Gathering
	The tweets were gathered using the Twitter scraping tool [Twint](https://github.com/twintproject/twint).
	"An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to
	scrape a user's followers, following, Tweets and more while evading most API limitations.".

	### Predicting: Topics, Sentiments & Targets
	The classifications that are given by GPT-3 for every tweet contain:
	- ```main_topic``` - a general topic which the tweet is about
	- ```sub_topic``` - a more descriptive topic
	- ```sentiment``` - is the tweet positive, negative or neutral?
	- ```target``` - who is the tweet targeting?

	The predicted features were extracted using AI, and more specifically by utilizing the autoregressive
	language model GPT-3 by [OpenAI](https://openai.com/api/) in combination with our prompt engineering.
	The final prompt is a result of experimentation while also trying to limit the length of the prompt.
	Since OpenAI charge users based on "tokens", which is closely related to number of words, it would be
	economically unsuitable to use lengthy prompts when classifying several thousands of tweets.

	### Merging Topics & Targets
	Since the output from GPT-3 varies a lot, e.g. topics can be similar but not identical, a method for
	clustering similar topics and targets was needed in order to be able to represent statistics of these
	classifications. Thus, the NN-algorithm was implemented using the cosine similarity as metric, after
	transforming topics and targets to dense vector representations with
	[Sentence Transformers](https://github.com/UKPLab/sentence-transformers). The similarities between the
	classification from GPT-3 and words from a predefined set of classes are then calculated, and the
	classification is changed to the predefined class that yielded the highest cosine similarity. It is worth
	noting that each predefined class has several "synonyms" or categorical words, and that the highest cosine
	similarity can be found between the classification and a word from that list.

	Example - The GPT-3 classified topics are: ```main_topic = "sport"``` and ```sub_topic = "soccer"``` ->
	```old_topic = "sport and soccer"```, and gets the highest similarity when compared to the word/synonym
	"soccer" which is in the subset to the predefined class ```"Civil society and sport"``` ->
	```new_topic = "Civil society and sport"```


	### Website User Manual
	1. Enter the time period of the tweets that you want to look into - The dates need to be on the format
	"YYYY-MM-DD", and between 2021-06-28 and 2022-08-10. It is preferable to choose a bigger span rather than a
	smaller one. Keep in mind that tweets the number of tweets posted by the party leaders will vary a lot
	(Annie Lööf: 1814 vs Jimmie Åkesson: 185 tweets in total).
	2. Select the party leader(s) you want to look into - At least one party leader has to be selected.
	3. Select the classifications you want to see statistics of: topic, sentiment and/or target.
	4. Apply - always press this button after you check a new box or change the dates to update the website.
	5. Run
	The pie charts and bar graphs should appear for your selected party leaders. Under the plots, a new panel
	will appear which lets users see how a prediction was made, i.e. classification from GPT-3 -> the
	6. To see examples of how the topic/sentiment/target was predicted, the user can select a type of
	classification and check the box "show stats". To download the CSV file containing all tweets and
	classifications for the checked party leaders, the user can check "Export file". After the selections, Apply and Run.

	### Data Frame Structure
	Each row in the database has the following structure:
	```id,tweet,date,user_id,username,urls,nlikes,nreplies,nretweets,class_tuple,main_topic,sub_topic,sentiment,target,merged_tuple,merged_topic,merged_target,cos_sim_topic,synonym_topic,cos_sim_target,synonym_target```.


	<!---
	### For Developers
	För att utveckla lokalt:

	1. Skapa en ny branch https://www.howtogeek.com/714112/how-to-create-a-new-branch-in-github/
	a. nämn den till dittnamn_dev

	2. lokalt kan ni klona branchen med 'git clone https url'. Mer info: https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository

	3. Ni kommer få knappa in er UserName och Lösenord. Detta Lösenord är INTE samma som ni loggar in, utan är ett temporärt Personal Authentication Token som hämtas från https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token

	4. För att inte alltid behöva knappa in den långa strängen, följ https://medium.com/ci-cd-devops/how-to-cache-your-personal-access-token-pat-in-linux-environment-97791424eb83 för att lagra den i din maskin.

	5. Nu går det bra att utveckla lokalt och pusha upp till remote server i samma branch.


	To get all the dependencies:

	1. Create a virtual environment: https://docs.python.org/3/library/venv.html
	2. Activate your virtual environment
	3. Go to the root path of the project and type in the terminal:
	$ pip install -r requirements.txt
	4. In some cases twint does not install properly for Ubutnu. After setting everything up, this was solved by typing "sudo apt-get install build-essential" in the terminal
	5. In order to use OpenAI you need an authorization token. This is created by creating an 'env.' file to the root path of the project.
	6. Type following in that file:
	OPENAI_AUTHTOKEN = your open-ai token
	7. The python file TextClassifier should now be able to use OpenAI, given that your token is not used up.






	![Flowcharts(4)](https://user-images.githubusercontent.com/44498515/178435693-bd86c2b6-23ac-4b69-94ae-366553468502.png)

	För att köra, skriv i terminalen.
	1. '$ python app.py'
	2. eller '$ make' --->