metadata

title: Pinpoint
emoji: 🔎
colorFrom: black
colorTo: red
sdk: gradio
sdk_version: 3.0.20
app_file: app.py
pinned: false
license: gpl-3.0

⚠️ This repository is based on PhD research that seeks to identify radicalisation on online platforms. Due to this; text, themes, and content relating to far-right extremism are present in this repository. Please continue with care. ⚠️

http://www.samaritans.org - Call 116 123, https://www.act.campaign.gov.uk, https://www.actearly.uk, Prevent advice line 0800 011 3764

📍 Pinpoint is a suite of functionality for building and using a binary classifier for the identification of extremist content. 💻

Pinpoint Violent-Far-Right Extreamism Classification

This wookshet can be used to identify if a given piece of text contains violent-far-right extreamism. This uses Pinpoint, a binary classifier framework. This classifier breaks feature extraction down into three main categories: Radical Language (textual features), Psychological Signals (psychological features), and Behavioural Features. A summary of these can be seen below:

Radical Language

Two feature groups are used in the radical language corpus category:

Word Vector Embedding scores - TF-IDF scores for each n-gram (uni-grams, bi-grams, and tri-grams) in the Stormfront corpus, of which the top scoring are used to train a word2vec model (using the word2vec model implemented in the genism package). The feature is the result of the average of the vectors of each word concatenated with the maximum and average for each word (In turn the post is represented by 200 dimension sized vector).
Capital Letter and Violent word frequency - In past literature it has been highlighted that capital word frequency is useful in the identification of “yelling behaviour” and that the use of violent, radicalised, and terrorist dictionaries can be useful for the identification of violent behaviour. A series of violent word dictionaries have been used for this purpose.

Psychological Signals

Research in behavioural residue and digital foot-printing suggests that individual’s leave indicators of their personality online based on their day to day choices - including word choice. There is a wide variety of past research that highlights how terrorists and extremists can have differing personalities to non extremists. This category focuses on these word choices by focusing on two feature groups;

LIWC Dictionary Scores - LIWC dictionary scores for each Parler text post and Stormfront forum post we're calculated in the data aggregation phase. The Parler text post LIWC scores will be used as features when training the binary classifier in the next section. The specific scores we're chosen based on previous terrorism social media text identification research. These being: Clout, Analytic, Tone, Authentic, Anger, Sadness, Anxiety, Power, Reward, Risk, Achievement, Affiliation, I Pronoun, and P Pronoun.
Minkowski distance - The minoski distance is calculated betwen a posts LIWC dictionary scores (see above) and the average LIWC scores for the Stormfront forum posts.

Behavioural Features

Behavioural features relate to capturing information relating to how a specific individual acts and portrays themselves online - including interactions such as who they follow, what they post about, and how often they post. This type of feature category normally includes information such as follower frequency (the ratio of the number of accounts the user is following versus the number of accounts following them) and post frequency (the ratio of posts posted since the account was created), however, due to the limitation's of this data-set these features have not been implemented. That being the case the below feature, that provides behavioural information, has been implemented:

Centrality - To capture how user’s interact with other users a mention interaction graph was used. This is where edges are created between user when a user’s post mentions the other. Hashtag nodes are also created when a user mentions a hashtag. After all user’s posts have been added to the graph the degree of influence each user has over the network is calculated using the degree centrality method.

Due to the nature of this worksheet and having the ability to immediately identify a given piece of text as far-right-extremism, only Radical Language features are used. This model has the following statistics:

- Accuracy: 0.7368404073424881
- Recall: 0.6270593997684567
- Precision: 0.7848464582288358
- F-Measure: 0.6971362094997648