{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "e3b3364f" }, "source": [ "# Youtube Hate Speech ML Project\n", "\n" ], "id": "e3b3364f" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "12f895df" }, "outputs": [], "source": [ "# First thing is to import libraries.\n", "# I am familiar with pandas and numpy, but lets research the other libraries." ], "id": "12f895df" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7cfcb724" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.tree import DecisionTreeClassifier" ], "id": "7cfcb724" }, { "cell_type": "markdown", "metadata": { "id": "6b1fd69c" }, "source": [ "### Feature Extraction\n", "- `from sklearn.feature_extraction.text import CountVectorizer` coverts text documents into a matrix\n", " of token counts\n", "- count vectorizers assign numbers to all instances of \"features\", or the words in a document.\n", "- `model_selection` provides tools for model selection and evaluation.\n", "- `train_test_split` will split our data into training and testing sets.\n", "- we use the DecisionTreeClassifier to train our model. We will use a decision tree to create labels and classify thise labels." ], "id": "6b1fd69c" }, { "cell_type": "markdown", "metadata": { "id": "ad8a3dad" }, "source": [ "- stopwords are words frequently occuring in a language and removed during text preprocessing.\n", "- `import pr` prints objects in a human-readable format and is used for working with nltk.\n", "- `from nltk.stem.snowball import SnowballStemmer` imports the stemming algorithm, which reduces words to their stem.\n", "- An example of the Stemming aglorithm is reducing the word 'running' to 'run'\n" ], "id": "ad8a3dad" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "60ab1d26", "outputId": "9509eed2-199d-4559-988a-fd1780e6ea3c" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import re\n", "import string\n", "import nltk\n", "nltk.download('stopwords')\n", "from nltk.stem.snowball import SnowballStemmer" ], "id": "60ab1d26" }, { "cell_type": "markdown", "metadata": { "id": "538a8bf3" }, "source": [ "### Imports" ], "id": "538a8bf3" }, { "cell_type": "markdown", "metadata": { "id": "189098f2" }, "source": [ "- `import re` imports the \"regular expressions\" module, which allows us to work with strings easier.\n", "- `import string` imports the `string` modile, which allows for easier string formatting for text processing tasks.\n", "\n", "### Import `nltk` is SUPER important. it does the following: Tokenization: Breaking text into words or sentences.\n", "\n", "1. Part-of-Speech Tagging: Assigning grammatical categories (e.g., noun, verb, adjective) to words in a sentence.\n", "2. Named Entity Recognition (NER): Identifying and classifying named entities such as persons, organizations, and locations in text.\n", "3. Parsing: Analyzing the syntactic structure of sentences.\n", "4. WordNet Integration: Accessing lexical database for English, WordNet, for synonyms, antonyms, and other lexical relationships.\n", "5. Text Corpora: Access to various text corpora for training and testing NLP models.\n", "5. Text Classification And More\n", "\n", "`nltk.corpus` is a module by Natural language toolkit. Contains large sets of text for linguistic analysis and development." ], "id": "189098f2" }, { "cell_type": "markdown", "metadata": { "id": "b375415f" }, "source": [ "`.words` method access words from our corpus and in this instance, we are calling the stopwords in the english set." ], "id": "b375415f" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bae03f72", "scrolled": true }, "outputs": [], "source": [ "stemmer = nltk.SnowballStemmer(\"english\")\n", "from nltk.corpus import stopwords\n", "import string\n", "stopword = set(stopwords.words(\"english\"))" ], "id": "bae03f72" }, { "cell_type": "markdown", "metadata": { "id": "b13b94ce" }, "source": [ "### The dataset we will use can be found here:\n", "https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbmp2NDVmdmoyNGpSS2hrdGNwdndvRl9WZnpqQXxBQ3Jtc0traEYtV0J4Ym5iYlJJa05tOXJyc1RSOEVzcnhOTUVxbU9YQnV2TWNZWVZ4WWZwRThCOTRUR2hFam9mbDZ5cW1Pa0VfRXhTcmhVaTBvX3pWeUN0THhwYVQycWZVbE1vcmxnakphdFF3SGxCMXhDNF9FUQ&q=https%3A%2F%2Fdrive.google.com%2Fdrive%2Ffolders%2F1uQiyJ_mDlOCcecMw7C-JYUs9bGnVJ_j8%3Fusp%3Dsharing&v=jbexvUovHxw" ], "id": "b13b94ce" }, { "cell_type": "markdown", "metadata": { "id": "bf758de5" }, "source": [ "- we import our dataset\n", "- we dropped any `NaN` values\n", "- we summoned the info of our dataset." ], "id": "bf758de5" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "6de55c38", "outputId": "5fad1bd7-e378-4a3b-9ef3-029e9e0762e6" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Unnamed: 0 count hate_speech offensive_language neither class \\\n", "0 0 3 0 0 3 2 \n", "1 1 3 0 3 0 1 \n", "2 2 3 0 3 0 1 \n", "3 3 3 0 2 1 1 \n", "4 4 6 0 6 0 1 \n", "\n", " tweet \n", "0 !!! RT @mayasolovely: As a woman you shouldn't... \n", "1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... \n", "2 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... \n", "3 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... \n", "4 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0counthate_speechoffensive_languageneitherclasstweet
0030032!!! RT @mayasolovely: As a woman you shouldn't...
1130301!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2230301!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3330211!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4460601!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df1", "summary": "{\n \"name\": \"df1\",\n \"rows\": 24783,\n \"fields\": [\n {\n \"column\": \"Unnamed: 0\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7299,\n \"min\": 0,\n \"max\": 25296,\n \"num_unique_values\": 24783,\n \"samples\": [\n 2326,\n 16283,\n 19362\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 3,\n \"max\": 9,\n \"num_unique_values\": 5,\n \"samples\": [\n 6,\n 7,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"hate_speech\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 7,\n \"num_unique_values\": 8,\n \"samples\": [\n 1,\n 6,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"offensive_language\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 9,\n \"num_unique_values\": 10,\n \"samples\": [\n 8,\n 3,\n 7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"neither\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 9,\n \"num_unique_values\": 10,\n \"samples\": [\n 8,\n 0,\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"class\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tweet\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 24783,\n \"samples\": [\n \"934 8616\\ni got a missed call from yo bitch\",\n \"RT @KINGTUNCHI_: Fucking with a bad bitch you gone need some money lil homie!\",\n \"RT @eanahS__: @1inkkofrosess lol my credit ain't no where near good , but I know the right man for the job .. that ho nice though!\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 38 } ], "source": [ "df1 = pd.read_csv(\"twitter_data.csv\")\n", "df1 = df1.dropna()\n", "df1.head()" ], "id": "6de55c38" }, { "cell_type": "markdown", "metadata": { "id": "4c288f90" }, "source": [ "#### `.tolist()` converts NumPy arrays into Python lists." ], "id": "4c288f90" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dd22b72e", "outputId": "2fc7e2a7-d2ea-4e8d-ee2f-5a84eeaef1a7" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['Unnamed: 0', 'count', 'hate_speech', 'offensive_language', 'neither', 'class', 'tweet']\n" ] } ], "source": [ "print(df1.columns.tolist())\n" ], "id": "dd22b72e" }, { "cell_type": "markdown", "metadata": { "id": "f584d46b" }, "source": [ "- The `.map()` function applies a specified function to an iterable and returns the result.\n", "- We used the `.map` function to assign 0, 1, and 2 to \"Hate Speech Detected\", \"Offensive language detected\", and \"No hate and - - offensive speech\"" ], "id": "f584d46b" }, { "cell_type": "markdown", "source": [ "### Preprocess the Labels" ], "metadata": { "id": "MSIgr88pMz8x" }, "id": "MSIgr88pMz8x" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "117eadd5" }, "outputs": [], "source": [ "df1['labels'] = df1['class'].map({0:\"Hate Speech Detected\", 1:\"Offensive language detected\", 2:\"No hate and offensive speech\"})\n", "\n", "# Merging the labels\n", "def unify_labels(row):\n", " if row['labels'] in ['Hate Speech Detected', 'Offensive language detected']:\n", " return 'Offensive or Hate Speech'\n", " else:\n", " return 'Not Hate'\n", "\n", "# Apply this function to the dataset with three labels\n", "df1['labels'] = df1.apply(unify_labels, axis=1)" ], "id": "117eadd5" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8fdf617f", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "38e9261c-f906-476f-9179-71367b4b1c6b" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Unnamed: 0 count hate_speech offensive_language neither class \\\n", "0 0 3 0 0 3 2 \n", "1 1 3 0 3 0 1 \n", "2 2 3 0 3 0 1 \n", "3 3 3 0 2 1 1 \n", "4 4 6 0 6 0 1 \n", "\n", " tweet labels \n", "0 !!! RT @mayasolovely: As a woman you shouldn't... Not Hate \n", "1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... Offensive or Hate Speech \n", "2 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... Offensive or Hate Speech \n", "3 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... Offensive or Hate Speech \n", "4 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... Offensive or Hate Speech " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0counthate_speechoffensive_languageneitherclasstweetlabels
0030032!!! RT @mayasolovely: As a woman you shouldn't...Not Hate
1130301!!!!! RT @mleew17: boy dats cold...tyga dwn ba...Offensive or Hate Speech
2230301!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...Offensive or Hate Speech
3330211!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...Offensive or Hate Speech
4460601!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...Offensive or Hate Speech
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df1", "summary": "{\n \"name\": \"df1\",\n \"rows\": 24783,\n \"fields\": [\n {\n \"column\": \"Unnamed: 0\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7299,\n \"min\": 0,\n \"max\": 25296,\n \"num_unique_values\": 24783,\n \"samples\": [\n 2326,\n 16283,\n 19362\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 3,\n \"max\": 9,\n \"num_unique_values\": 5,\n \"samples\": [\n 6,\n 7,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"hate_speech\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 7,\n \"num_unique_values\": 8,\n \"samples\": [\n 1,\n 6,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"offensive_language\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 9,\n \"num_unique_values\": 10,\n \"samples\": [\n 8,\n 3,\n 7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"neither\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 9,\n \"num_unique_values\": 10,\n \"samples\": [\n 8,\n 0,\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"class\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tweet\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 24783,\n \"samples\": [\n \"934 8616\\ni got a missed call from yo bitch\",\n \"RT @KINGTUNCHI_: Fucking with a bad bitch you gone need some money lil homie!\",\n \"RT @eanahS__: @1inkkofrosess lol my credit ain't no where near good , but I know the right man for the job .. that ho nice though!\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"labels\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Offensive or Hate Speech\",\n \"Not Hate\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 41 } ], "source": [ "df1['labels'].info\n", "df1.head()" ], "id": "8fdf617f" }, { "cell_type": "markdown", "source": [ "### Import the second dataset" ], "metadata": { "id": "9DgbrPGdSk5O" }, "id": "9DgbrPGdSk5O" }, { "cell_type": "code", "source": [ "!pip install datasets\n", "\n", "from datasets import load_dataset\n", "\n", "df2 = load_dataset(\"LennardZuendorf/Dynamically-Generated-Hate-Speech-Dataset\")" ], "metadata": { "id": "2VaMzXeZQQIz", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "431ad557-0e4d-42c9-ecfe-91ffa3598e17" }, "id": "2VaMzXeZQQIz", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.18.0)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)\n", "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)\n", "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n", "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n", "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)\n", "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.2)\n", "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n", "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n", "Requirement already satisfied: fsspec[http]<=2024.2.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n", "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.3)\n", "Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.20.3)\n", "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.0)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.19.4->datasets) (4.10.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.6)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2024.2.2)\n", "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n" ] } ] }, { "cell_type": "code", "source": [ "# Convert it to a DataFrame\n", "df2 = df2['train'].to_pandas() # or dataset['split'].to_pandas()\n", "\n", "# Rename columns\n", "df2.rename(columns={'text': 'tweet', 'label': 'labels'}, inplace=True)\n", "\n", "# Replace label values\n", "label_mapping = {'hate': 'Offensive or Hate Speech', 'nothate': 'Not Hate'}\n", "df2['labels'] = df2['labels'].replace(label_mapping)" ], "metadata": { "id": "ImSWI82fQ4yb" }, "id": "ImSWI82fQ4yb", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df2['labels'].info\n", "df2.head()" ], "metadata": { "id": "GZDrcM1ESa5e", "colab": { "base_uri": "https://localhost:8080/", "height": 310 }, "outputId": "1a9516b2-72d9-4a57-fbcd-ad2fdf189dfe" }, "id": "GZDrcM1ESa5e", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Unnamed: 0 acl.id X1 \\\n", "0 1 acl1 1 \n", "1 2 acl10 2 \n", "2 3 acl100 3 \n", "3 4 acl1000 4 \n", "4 5 acl10000 5 \n", "\n", " tweet \\\n", "0 dalits are lowlives \n", "1 gay people are a burden to our society \n", "2 Arabs are not welcome \n", "3 I'm not saying we should actually eliminate he... \n", "4 bananas are for black people \n", "\n", " labels type target level split round.base \\\n", "0 Offensive or Hate Speech notgiven notgiven original train 1 \n", "1 Offensive or Hate Speech notgiven notgiven original train 1 \n", "2 Offensive or Hate Speech notgiven notgiven original train 1 \n", "3 Offensive or Hate Speech notgiven notgiven original train 1 \n", "4 Offensive or Hate Speech notgiven notgiven original test 1 \n", "\n", " annotator round acl.id.matched \n", "0 GNZuCtwed3 1 None \n", "1 GNZuCtwed3 1 None \n", "2 vDe7GN0NrL 1 None \n", "3 oemYWm1Tjg 1 None \n", "4 QiOKkCi7F8 1 None " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0acl.idX1tweetlabelstypetargetlevelsplitround.baseannotatorroundacl.id.matched
01acl11dalits are lowlivesOffensive or Hate Speechnotgivennotgivenoriginaltrain1GNZuCtwed31None
12acl102gay people are a burden to our societyOffensive or Hate Speechnotgivennotgivenoriginaltrain1GNZuCtwed31None
23acl1003Arabs are not welcomeOffensive or Hate Speechnotgivennotgivenoriginaltrain1vDe7GN0NrL1None
34acl10004I'm not saying we should actually eliminate he...Offensive or Hate Speechnotgivennotgivenoriginaltrain1oemYWm1Tjg1None
45acl100005bananas are for black peopleOffensive or Hate Speechnotgivennotgivenoriginaltest1QiOKkCi7F81None
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df2", "summary": "{\n \"name\": \"df2\",\n \"rows\": 82399,\n \"fields\": [\n {\n \"column\": \"Unnamed: 0\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12249,\n \"min\": 1,\n \"max\": 331000,\n \"num_unique_values\": 41323,\n \"samples\": [\n 20688,\n 24611,\n 24417\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"acl.id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 41255,\n \"samples\": [\n \"acl32247\",\n \"acl13850\",\n \"acl3767\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"X1\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11903,\n \"min\": 1,\n \"max\": 41255,\n \"num_unique_values\": 41255,\n \"samples\": [\n 24722,\n 4281,\n 30746\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tweet\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 41134,\n \"samples\": [\n \"Interestingly, we have a fuckton of Qs in our community and they are one of the biggest bulliers of LGBTs! Awful!\",\n \"My main problems are depression and family issues. Soceity is fucked when you have people like my neighbour living next door. HELP ME.\",\n \"The world needs fewer muslims and more white supremacists\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"labels\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Not Hate\",\n \"Offensive or Hate Speech\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"notgiven\",\n \"none\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"target\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 409,\n \"samples\": [\n \"bla.wom, asi.south, mixed.race\",\n \"dis, african\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"level\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"perturbation\",\n \"original\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"split\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"train\",\n \"test\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"round.base\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 4,\n \"num_unique_values\": 4,\n \"samples\": [\n 2,\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"annotator\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 20,\n \"samples\": [\n \"GNZuCtwed3\",\n \"dqrONtdjbt\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"round\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"1\",\n \"2b\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"acl.id.matched\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 30098,\n \"samples\": [\n \"acl29061\",\n \"acl15075\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 44 } ] }, { "cell_type": "markdown", "source": [ "### Import the third dataset" ], "metadata": { "id": "u6MpXiwCgju8" }, "id": "u6MpXiwCgju8" }, { "cell_type": "code", "source": [ "# https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech\n", "\n", "df3 = load_dataset('ucberkeley-dlab/measuring-hate-speech', 'default')\n", "\n", "# Convert it to a DataFrame\n", "df3 = df3['train'].to_pandas()\n", "\n", "def classify_or_mark_for_test(row):\n", " if row['hate_speech_score'] >= -0.25:\n", " return 'Offensive or Hate Speech'\n", " elif row['hate_speech_score'] < -0.25:\n", " return 'Not Hate'\n", " else:\n", " return None\n", "\n", "# Apply the modified function\n", "df3['labels'] = df3.apply(classify_or_mark_for_test, axis=1)\n", "\n", "# Rename columns\n", "df3.rename(columns={'text': 'tweet'}, inplace=True)" ], "metadata": { "id": "AaO7r3wpgXlt" }, "id": "AaO7r3wpgXlt", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df3['labels'].info\n", "df3.head()" ], "metadata": { "id": "RUM4jbgWjenb", "colab": { "base_uri": "https://localhost:8080/", "height": 360 }, "outputId": "9aeb7bf5-3239-41ec-acbd-2039d4e75bc2" }, "id": "RUM4jbgWjenb", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " comment_id annotator_id platform sentiment respect insult humiliate \\\n", "0 47777 10873 3 0.0 0.0 0.0 0.0 \n", "1 39773 2790 2 0.0 0.0 0.0 0.0 \n", "2 47101 3379 3 4.0 4.0 4.0 4.0 \n", "3 43625 7365 3 2.0 3.0 2.0 1.0 \n", "4 12538 488 0 4.0 4.0 4.0 4.0 \n", "\n", " status dehumanize violence ... annotator_religion_jewish \\\n", "0 2.0 0.0 0.0 ... False \n", "1 2.0 0.0 0.0 ... False \n", "2 4.0 4.0 0.0 ... False \n", "3 2.0 0.0 0.0 ... False \n", "4 4.0 4.0 4.0 ... False \n", "\n", " annotator_religion_mormon annotator_religion_muslim \\\n", "0 False False \n", "1 False False \n", "2 False False \n", "3 False False \n", "4 False False \n", "\n", " annotator_religion_nothing annotator_religion_other \\\n", "0 False False \n", "1 False False \n", "2 True False \n", "3 False False \n", "4 False False \n", "\n", " annotator_sexuality_bisexual annotator_sexuality_gay \\\n", "0 False False \n", "1 False False \n", "2 False False \n", "3 False False \n", "4 False False \n", "\n", " annotator_sexuality_straight annotator_sexuality_other \\\n", "0 True False \n", "1 True False \n", "2 True False \n", "3 True False \n", "4 True False \n", "\n", " labels \n", "0 Not Hate \n", "1 Not Hate \n", "2 Offensive or Hate Speech \n", "3 Offensive or Hate Speech \n", "4 Offensive or Hate Speech \n", "\n", "[5 rows x 132 columns]" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
comment_idannotator_idplatformsentimentrespectinsulthumiliatestatusdehumanizeviolence...annotator_religion_jewishannotator_religion_mormonannotator_religion_muslimannotator_religion_nothingannotator_religion_otherannotator_sexuality_bisexualannotator_sexuality_gayannotator_sexuality_straightannotator_sexuality_otherlabels
0477771087330.00.00.00.02.00.00.0...FalseFalseFalseFalseFalseFalseFalseTrueFalseNot Hate
139773279020.00.00.00.02.00.00.0...FalseFalseFalseFalseFalseFalseFalseTrueFalseNot Hate
247101337934.04.04.04.04.04.00.0...FalseFalseFalseTrueFalseFalseFalseTrueFalseOffensive or Hate Speech
343625736532.03.02.01.02.00.00.0...FalseFalseFalseFalseFalseFalseFalseTrueFalseOffensive or Hate Speech
41253848804.04.04.04.04.04.04.0...FalseFalseFalseFalseFalseFalseFalseTrueFalseOffensive or Hate Speech
\n", "

5 rows × 132 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df3" } }, "metadata": {}, "execution_count": 46 } ] }, { "cell_type": "markdown", "metadata": { "id": "a420ba1c" }, "source": [ "### Formated to two tables of tweets and labels" ], "id": "a420ba1c" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5db5746b", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "6af1b33a-f061-4612-d16d-76d022cdf443" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " tweet labels\n", "0 !!! RT @mayasolovely: As a woman you shouldn't... Not Hate\n", "1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... Offensive or Hate Speech\n", "2 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... Offensive or Hate Speech\n", "3 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... Offensive or Hate Speech\n", "4 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... Offensive or Hate Speech" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweetlabels
0!!! RT @mayasolovely: As a woman you shouldn't...Not Hate
1!!!!! RT @mleew17: boy dats cold...tyga dwn ba...Offensive or Hate Speech
2!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...Offensive or Hate Speech
3!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...Offensive or Hate Speech
4!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...Offensive or Hate Speech
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df1", "summary": "{\n \"name\": \"df1\",\n \"rows\": 24783,\n \"fields\": [\n {\n \"column\": \"tweet\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 24783,\n \"samples\": [\n \"934 8616\\ni got a missed call from yo bitch\",\n \"RT @KINGTUNCHI_: Fucking with a bad bitch you gone need some money lil homie!\",\n \"RT @eanahS__: @1inkkofrosess lol my credit ain't no where near good , but I know the right man for the job .. that ho nice though!\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"labels\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Offensive or Hate Speech\",\n \"Not Hate\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 47 } ], "source": [ "df1 = df1[['tweet', 'labels']]\n", "df1 = df1[['tweet', 'labels']].fillna(0)\n", "df1.head()" ], "id": "5db5746b" }, { "cell_type": "code", "source": [ "df2 = df2[['tweet', 'labels']]\n", "\n", "# Manually set the label for rows with indexes 5 and 17 to \"Offensive or Hate Speech\"\n", "df2.loc[[5, 17], 'labels'] = 'Offensive or Hate Speech'\n", "df2.head()" ], "metadata": { "id": "VmXNfJ3jTGFV", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "51c4e813-2fad-44c8-fdc9-c9e1a2c43f0a" }, "id": "VmXNfJ3jTGFV", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " tweet labels\n", "0 dalits are lowlives Offensive or Hate Speech\n", "1 gay people are a burden to our society Offensive or Hate Speech\n", "2 Arabs are not welcome Offensive or Hate Speech\n", "3 I'm not saying we should actually eliminate he... Offensive or Hate Speech\n", "4 bananas are for black people Offensive or Hate Speech" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweetlabels
0dalits are lowlivesOffensive or Hate Speech
1gay people are a burden to our societyOffensive or Hate Speech
2Arabs are not welcomeOffensive or Hate Speech
3I'm not saying we should actually eliminate he...Offensive or Hate Speech
4bananas are for black peopleOffensive or Hate Speech
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df2", "summary": "{\n \"name\": \"df2\",\n \"rows\": 82399,\n \"fields\": [\n {\n \"column\": \"tweet\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 41134,\n \"samples\": [\n \"Interestingly, we have a fuckton of Qs in our community and they are one of the biggest bulliers of LGBTs! Awful!\",\n \"My main problems are depression and family issues. Soceity is fucked when you have people like my neighbour living next door. HELP ME.\",\n \"The world needs fewer muslims and more white supremacists\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"labels\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Not Hate\",\n \"Offensive or Hate Speech\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 48 } ] }, { "cell_type": "code", "source": [ "df3 = df3[['tweet', 'labels']]\n", "df3.head()" ], "metadata": { "id": "tFHq-la7lW85", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "42396c47-5555-4004-e1ba-6185048835c2" }, "id": "tFHq-la7lW85", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " tweet labels\n", "0 Yes indeed. She sort of reminds me of the elde... Not Hate\n", "1 The trans women reading this tweet right now i... Not Hate\n", "2 Question: These 4 broads who criticize America... Offensive or Hate Speech\n", "3 It is about time for all illegals to go back t... Offensive or Hate Speech\n", "4 For starters bend over the one in pink and kic... Offensive or Hate Speech" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweetlabels
0Yes indeed. She sort of reminds me of the elde...Not Hate
1The trans women reading this tweet right now i...Not Hate
2Question: These 4 broads who criticize America...Offensive or Hate Speech
3It is about time for all illegals to go back t...Offensive or Hate Speech
4For starters bend over the one in pink and kic...Offensive or Hate Speech
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df3" } }, "metadata": {}, "execution_count": 49 } ] }, { "cell_type": "markdown", "metadata": { "id": "0e4048ff" }, "source": [ "### Now begins the process of cleaning the text" ], "id": "0e4048ff" }, { "cell_type": "markdown", "metadata": { "id": "236e992d" }, "source": [ "- `clean(text)` invokes a function to perform text cleaning\n", "- `text = re.sub('\\[.*?\\]', '', text)` uses the `re` module to replace all characters `('\\[.*?\\]') with whitespace." ], "id": "236e992d" }, { "cell_type": "markdown", "metadata": { "id": "48f9812a" }, "source": [ "#### `re.sub` works like this: import `re`\n", "\n", "text = \"Hello, world! This is a test string.\"\n", "\n", "#### Replace all occurrences of 'world' with 'planet'\n", "\n", "new_text = re.sub(r'world', 'planet', text)\n", "print(new_text)`" ], "id": "48f9812a" }, { "cell_type": "markdown", "metadata": { "id": "0a4a9cdf" }, "source": [ "- we use the same `re.sub` method to replace `'https?://\\S+|www\\.\\S+'` and `'<.*?>+'` and fill with `''`, or whitespace.\n", "- breaking down the `('[%s]' % re.escape(string.punctuation)` code, `string.punctuation` contains all of the punctuation characters. `re.escape` will escape any punctuation string and treat them as literals." ], "id": "0a4a9cdf" }, { "cell_type": "markdown", "metadata": { "id": "4f3f70a2" }, "source": [ "- we put the `%s` in brackets `[]` because it is a charachter class, so the function will match any charachter contained in the given charachter class.\n", "- `%s` concattenates strings together.\n", "- `text = re.sub('\\n', '', text)` removes all of the newline characters from the text and replaces it with whitespace." ], "id": "4f3f70a2" }, { "cell_type": "markdown", "metadata": { "id": "5961fea1" }, "source": [ "### we will break down `text = re.sub('\\w*\\d\\w*', \"\", text)`\n", "- ` \\w` Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].\n", "- ` \\d` matches decimal digits 0-9\n", "- ` *` matches zero or more occurances of the preciding word character." ], "id": "5961fea1" }, { "cell_type": "markdown", "metadata": { "id": "d8fb9332" }, "source": [ "### `text = [word for word in text.split() if word not in stopword]` operates as such:\n", "\n", "1. `text.split()` splits input text into a list of words\n", "2. the function `[word for word in text.split() if word not in stopword]` creates a new list containing our words that are not stopwords.\n", "3. we then return the cleaned text with the `return text` call,\n", "4. We then apply our function to the \"tweet\" variable.\n", "\n", "This was not a part of the tutorial, but I applied `df[\"tweet\"] = df[\"tweet\"].dropna()` to eliminate NaN values from the dataset.\n" ], "id": "d8fb9332" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "604ec08e" }, "outputs": [], "source": [ "def clean(text):\n", " text = str(text).lower()\n", " text = re.sub('\\[.*?\\]', '', text)\n", " text = re.sub('https?://\\S+|www\\.\\S+', '', text)\n", " text = re.sub('<.*?>+', '', text)\n", " text = re.sub('[%s]' % re.escape(string.punctuation), '', text)\n", " text = re.sub('\\n', '', text)\n", " text = re.sub('\\w*\\d\\w*', \"\", text)\n", " text = [word for word in text.split() if word not in stopword]\n", " text = \" \".join(text)\n", " return text\n", "# Apply cleaning function to the 'tweet' column of both dataframes\n", "df1['tweet'] = df1['tweet'].apply(clean)" ], "id": "604ec08e" }, { "cell_type": "code", "source": [ "# Print the head of df1\n", "print(\"df1 head:\")\n", "print(df1.head())\n", "\n", "# Add a separator for better readability\n", "print(\"\\n\" + \"-\"*78 + \"\\n\")\n", "\n", "# Print the head of df2\n", "print(\"df2 head:\")\n", "print(df2.head())\n", "\n", "# Add a separator for better readability\n", "print(\"\\n\" + \"-\"*78 + \"\\n\")\n", "\n", "# Print the head of df3\n", "print(\"df3 head:\")\n", "print(df3.head())" ], "metadata": { "id": "mmh9Nc_GWzA3", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9d467693-3c2c-4957-f605-9dd1e45e13a2" }, "id": "mmh9Nc_GWzA3", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "df1 head:\n", " tweet labels\n", "0 rt mayasolovely woman shouldnt complain cleani... Not Hate\n", "1 rt boy dats coldtyga dwn bad cuffin dat hoe place Offensive or Hate Speech\n", "2 rt urkindofbrand dawg rt ever fuck bitch start... Offensive or Hate Speech\n", "3 rt cganderson vivabased look like tranny Offensive or Hate Speech\n", "4 rt shenikaroberts shit hear might true might f... Offensive or Hate Speech\n", "\n", "------------------------------------------------------------------------------\n", "\n", "df2 head:\n", " tweet labels\n", "0 dalits are lowlives Offensive or Hate Speech\n", "1 gay people are a burden to our society Offensive or Hate Speech\n", "2 Arabs are not welcome Offensive or Hate Speech\n", "3 I'm not saying we should actually eliminate he... Offensive or Hate Speech\n", "4 bananas are for black people Offensive or Hate Speech\n", "\n", "------------------------------------------------------------------------------\n", "\n", "df3 head:\n", " tweet labels\n", "0 Yes indeed. She sort of reminds me of the elde... Not Hate\n", "1 The trans women reading this tweet right now i... Not Hate\n", "2 Question: These 4 broads who criticize America... Offensive or Hate Speech\n", "3 It is about time for all illegals to go back t... Offensive or Hate Speech\n", "4 For starters bend over the one in pink and kic... Offensive or Hate Speech\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "7393229d" }, "source": [ "### Our data is ready. Now to build the classification model." ], "id": "7393229d" }, { "cell_type": "markdown", "metadata": { "id": "7e467fcc" }, "source": [ "- we start by creating NumPy arrays of our dataset and labels. with `np.array(df1[])`" ], "id": "7e467fcc" }, { "cell_type": "markdown", "metadata": { "id": "46b4e571" }, "source": [ "- `CountVectorizer()` comverts our text data into a matrix of token accounts.\n", "\n", "- `cv.fit_transform` applies the `fit_transform` method of the CountVectorizeer with x.\n", "\n", "- fit_transform analyzes our data to find characteristcs and transforms the data based on the learned parameters." ], "id": "46b4e571" }, { "cell_type": "markdown", "metadata": { "id": "abf75e09" }, "source": [ "- Next, we create the variables \"X_train, X_test, y_train, y_test\" and assign them to `train_test_split`\n", "- We input our x and y values into `train_test_split` with a `test_size` of .33 and a `random_state` of 42\n", "- the `test_size` value .33 means that 33 % of the data will be used for testing, while 67 % will be used for training.\n", "- the `random_state` function randomly shuffles our train and test values to mitigate bias.\n", "- assigning `random_state` to 42 will produce the same results after multiple executions." ], "id": "abf75e09" }, { "cell_type": "markdown", "metadata": { "id": "cb9565bd" }, "source": [ "### Decision Tree Classifier\n", "\n", "- Decision tree chooses best feature (the root)\n", "- Next, it creates a split based on a feature\n", "- Then, the decision tree repeats the above steps until the leaf node is pure, meaning there are no more decisions to make.\n", "- We chose a Decision Tree because our data and what we want it to do (predict hate speech)." ], "id": "cb9565bd" }, { "cell_type": "markdown", "metadata": { "id": "e156f5cc" }, "source": [ "![image.png](attachment:image.png)" ], "id": "e156f5cc" }, { "cell_type": "markdown", "metadata": { "id": "22e3bf28" }, "source": [ "#### Finally, we fit our model.\n", "\n", "- The `.fit` method takes the input features (X) and corresponding target labels (y) as arguments and uses them to train the model. In other words, fitting the model == training the model.\n", "\n", "- the line of code `clf.fit(X_train,y_train)` is equivalent to `DecisionTreeClassifier().fit(X_train,y_train)`.\n", "\n", "- We now have a trained classification model!\n" ], "id": "22e3bf28" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d683c415", "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "outputId": "8393400f-fa3f-4dd6-947c-0e038efeab37" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "DecisionTreeClassifier()" ], "text/html": [ "
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 52 } ], "source": [ "combined_df = pd.concat([df1, df2, df3], ignore_index=True)\n", "\n", "x = np.array(combined_df[\"tweet\"])\n", "y = np.array(combined_df[\"labels\"])\n", "\n", "cv = CountVectorizer()\n", "x = cv.fit_transform(x)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.25, random_state = 42)\n", "clf = DecisionTreeClassifier()\n", "clf.fit(X_train,y_train)\n", "\n" ], "id": "d683c415" }, { "cell_type": "code", "source": [ "from sklearn.metrics import accuracy_score, classification_report\n", "\n", "y_pred = clf.predict(X_test)\n", "print(f\"Accuracy: {accuracy_score(y_test, y_pred)}\")\n", "print(classification_report(y_test, y_pred))" ], "metadata": { "id": "sdFRCXtGY3yI", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "97939191-203a-4c8b-f0da-ac30fc3945ee" }, "id": "sdFRCXtGY3yI", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Accuracy: 0.9212820301557222\n", " precision recall f1-score support\n", "\n", " Not Hate 0.91 0.92 0.91 27996\n", "Offensive or Hate Speech 0.93 0.92 0.93 32689\n", "\n", " accuracy 0.92 60685\n", " macro avg 0.92 0.92 0.92 60685\n", " weighted avg 0.92 0.92 0.92 60685\n", "\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "b25a0cf6" }, "source": [ "### Testing the model\n", "- test_data will be the words that are input to test if the words are offensive or not." ], "id": "b25a0cf6" }, { "cell_type": "markdown", "metadata": { "id": "9902038d" }, "source": [ "#### Let's break down the code `df = cv.transform([test_data]).toarray()` :\n" ], "id": "9902038d" }, { "cell_type": "markdown", "metadata": { "id": "45c25d08" }, "source": [ "- `cv.transform` converts text data into a numerical representation suitable for machine learning algorithms to process.\n", "- We are transforming the `test_data` into a numerical vector representation using the `transform()` method.\n", "- Next, we use `.toarray()` to convert the previously transformed test data into a Numpy Array. This is due to sklearn requires dense arrays as input." ], "id": "45c25d08" }, { "cell_type": "markdown", "metadata": { "id": "14394f22" }, "source": [ "#### Lastly, we will make predictions using the trained model.\n", "\n", "- The `predict` method uses the data from `df` to return predicted values in an array containing the different values or labels for each data point.\n", "- In the case of our Hate speech ML Model, we use the `predict()` method to predict weather the text entered falls within our labels \"Hate Speech Detected\", \"Offensive language detected\", or \"No hate and offensive speech\".\n", "- `clf.predict(df)` means that the `decsion_tree_classifier` uses the `predict` method to take the features from our `df` array and make predictions based on learned patterns and relationships captured by the model during training." ], "id": "14394f22" }, { "cell_type": "markdown", "metadata": { "id": "7cb51e7d" }, "source": [ "### Below is the code that the video wanted me to use to test this model. I will write a function later so we do not have to re-type the test data string every time.\n", "- If you would like to test out the model, edit the words in the \"I will kill you\" string.\n", "- This model is not perfect; test it out and document what works and what does not." ], "id": "7cb51e7d" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4c8d6a3b", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "f4c6a6e2-19dd-43ed-ec4d-0a2c5bcc4440" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['Offensive or Hate Speech']\n" ] } ], "source": [ "test_data = \"Arabs welcome\"\n", "combined_df = cv.transform([test_data]).toarray()\n", "print(clf.predict(combined_df))" ], "id": "4c8d6a3b" }, { "cell_type": "markdown", "metadata": { "id": "2d0fbb0d" }, "source": [ "# Conclusion of Hate Speech Detection project" ], "id": "2d0fbb0d" }, { "cell_type": "markdown", "metadata": { "id": "7891f059" }, "source": [ "- I learned what many modules in sklearn do, such as `train_test_split()`,`fit()` and `CountVectorizer`.\n", "- I learned about the nltk library, which is quite useful for NLP and text processing tasks.\n", "- I learned about decision trees and stopwords.\n", "- I learned `.tolist()` converts NumPy arrays into Python lists.\n", "- I learned How to use the `map()` function\n", "- I learned much about the `re.sub` module for text cleaning\n", "- I learned why we use the values .33 and 42 as values in our decision tree model.\n", "- I learned how to train a test set. For this lab, it was done by:\n", "`X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.33, random_state = 42)`\n", "- I learned that the data needs to be transformed into a NumPy array with `.toarray()` before we can make predictions.\n", "- I learned that the `.predict` function predicts values based on the trained classifier. In this case, it was `clf`.\n", "\n", "\n", "\n" ], "id": "7891f059" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fb36a279" }, "outputs": [], "source": [], "id": "fb36a279" } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }