{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### **Sentiment Analysis of r/place 2023 Data**\n", "\n", "Using Natural Language Processing to track how the Reddit community felt about the r/place social experiment on July 2023. In this Jupyter notebook, I will be using both the bertweet-sentiment-analysis model to determine whether a comment is positive, neutral, or negative." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to import the necessary libraries. I import the csv library to read comments stored in the CSV file, and the transformers library will used to create a pipeline for the bertweet-sentiment-analysis model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Extracting Data**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Python311\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import csv\n", "import matplotlib.pyplot as plt\n", "from transformers import pipeline\n", "import random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we will extract comments from the CSV file and append them to a list." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# store the comments in a list\n", "# each row has one comment\n", "comments = []\n", "\n", "# open the file storing reddit comments\n", "# specify utf-8 encoding to prevent unicode decode error\n", "filepath = \"data/place_comments.csv\"\n", "with open(filepath, \"r\", encoding=\"utf-8\") as f:\n", " skip = next(f)\n", " csv_reader = csv.reader(f)\n", " for row in csv_reader:\n", " comments.append(row[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's record how many comments we have prior to analyzing the results." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8573\n" ] } ], "source": [ "n = len(comments)\n", "print(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Creating the Pipeline**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a pipeline using the bertweet-sentiment-analysis model as mentioned before." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers\n", "pip install xformers.\n" ] } ], "source": [ "# run the BERT model on the imported comments\n", "FILE_PATH = \"saved_model/\"\n", "sentiment_pipeline = pipeline(task=\"sentiment-analysis\",\n", " model=FILE_PATH,\n", " tokenizer=FILE_PATH)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort the comments from positive, neutral, or negative by putting them in their appropriate lists. For each comment, we will check whether the label is positive (label=2), neutral (label=1), or negative (label=0) and trim the comment down to 128 tokens if it's too big." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# sort the positive, neutral, and negative comments\n", "pos_comments = []\n", "neu_comments = []\n", "neg_comments = []\n", "\n", "# for each comment, determine whether the model identifies it as pos/neu/neg\n", "# keep up to TOKEN_SIZE tokens or the pipeline will crash\n", "TOKEN_SIZE = 128\n", "for comment in comments:\n", " result = 0\n", " if len(comment) > TOKEN_SIZE:\n", " result = sentiment_pipeline(comment[:TOKEN_SIZE])\n", " else:\n", " result = sentiment_pipeline(comment)\n", " label = result[0][\"label\"]\n", "\n", " if label == \"LABEL_2\":\n", " pos_comments.append(comment)\n", " elif label == \"LABEL_1\":\n", " neu_comments.append(comment)\n", " else:\n", " neg_comments.append(comment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Visualizing the Results**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've sorted between positive, neutral, and negative comments, let's graph the results using a pie chart." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# find the labels and the sizes of each comment category\n", "labels = [\"Positive\", \"Neutral\", \"Negative\"]\n", "sizes = [len(pos_comments), len(neu_comments), len(neg_comments)]\n", "\n", "# plot the results\n", "fig, ax = plt.subplots()\n", "ax.pie(sizes, labels=labels, autopct='%1.1f%%')\n", "plt.title(f\"Sentiment Distribution of r/place 2023\\nComments on Selected Discussion Threads\\n(sample size={n})\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of positive comments: 2136\n", "Number of neutral comments: 3972\n", "Number of negative comments: 2465\n" ] } ], "source": [ "print(\"Number of positive comments: \", len(pos_comments))\n", "print(\"Number of neutral comments: \", len(neu_comments))\n", "print(\"Number of negative comments: \", len(neg_comments))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Analyzing the Results**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model reports that an astounding 47.4% of the comments are neutral, while 40.5% of the comments are negative! In contrast, only 12.1% of the comments are positive. This suggests that the general sentiment of the r/place 2023 was generally negative!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can evaluate the model's accuracy based on random comments can be selected from the list." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I never participated in this before and honestly, fuck it next time. My square got undone by a \"new user\" and constantly continues to happen.\n", "wheres the heat map?\n", "fuck u/spez\n" ] } ], "source": [ "print(random.choice(pos_comments))\n", "print(random.choice(neu_comments))\n", "print(random.choice(neg_comments))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Room for Improvement**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As stated previously in the data visualization notebook, German comments may be difficult for many NLP models to evaluate, as they have been trained with English text. In future versions of this analysis project, I plan on separating English and German comments, and letting a German-based NLP model evaluate the German comments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Credits**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Jupyter notebook makes great use of the bertweet-sentiment-analysis model found at https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis. Thank you to Juan Manuel PĂ©rez (github: finiteautomata) for creating this model and publishing it on huggingface.co.\n", "\n", "The matplotlib documentation was used to create the pie model visualization. Source: https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_features.html" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }