{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### **Visualizing Data from r/place**\n", "Here, I will be using matplotlib to visualize common words to preview whether the sentiment to r/place 2023 was generally negative, neutral, or positive." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we will find the most common comments on r/place 2023." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'place_comments.csv'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn[7], line 8\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[39m# open the file storing reddit comments\u001b[39;00m\n\u001b[0;32m 6\u001b[0m \u001b[39m# specify utf-8 encoding to prevent unicode decode error\u001b[39;00m\n\u001b[0;32m 7\u001b[0m filepath \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mplace_comments.csv\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m----> 8\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mopen\u001b[39;49m(filepath, \u001b[39m\"\u001b[39;49m\u001b[39mr\u001b[39;49m\u001b[39m\"\u001b[39;49m, encoding\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mutf-8\u001b[39;49m\u001b[39m\"\u001b[39;49m) \u001b[39mas\u001b[39;00m f:\n\u001b[0;32m 9\u001b[0m skip \u001b[39m=\u001b[39m \u001b[39mnext\u001b[39m(f)\n\u001b[0;32m 10\u001b[0m csv_reader \u001b[39m=\u001b[39m csv\u001b[39m.\u001b[39mreader(f)\n", "File \u001b[1;32m~\\AppData\\Roaming\\Python\\Python311\\site-packages\\IPython\\core\\interactiveshell.py:284\u001b[0m, in \u001b[0;36m_modified_open\u001b[1;34m(file, *args, **kwargs)\u001b[0m\n\u001b[0;32m 277\u001b[0m \u001b[39mif\u001b[39;00m file \u001b[39min\u001b[39;00m {\u001b[39m0\u001b[39m, \u001b[39m1\u001b[39m, \u001b[39m2\u001b[39m}:\n\u001b[0;32m 278\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 279\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mIPython won\u001b[39m\u001b[39m'\u001b[39m\u001b[39mt let you open fd=\u001b[39m\u001b[39m{\u001b[39;00mfile\u001b[39m}\u001b[39;00m\u001b[39m by default \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 280\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mas it is likely to crash IPython. If you know what you are doing, \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 281\u001b[0m \u001b[39m\"\u001b[39m\u001b[39myou can use builtins\u001b[39m\u001b[39m'\u001b[39m\u001b[39m open.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 282\u001b[0m )\n\u001b[1;32m--> 284\u001b[0m \u001b[39mreturn\u001b[39;00m io_open(file, \u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'place_comments.csv'" ] } ], "source": [ "# store the sentence frequencies in a dict\n", "# key=word, value=frequency\n", "freqs = dict()\n", "\n", "# open the file storing reddit comments\n", "# specify utf-8 encoding to prevent unicode decode error\n", "filepath = \"data/place_comments.csv\"\n", "with open(filepath, \"r\", encoding=\"utf-8\") as f:\n", " skip = next(f)\n", " csv_reader = csv.reader(f)\n", " for row in csv_reader:\n", " comment = row[0].lower()\n", " if comment not in freqs:\n", " freqs[comment] = 0\n", " freqs[comment] += 1" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# sort the words in decreasing frequency\n", "comments = [(comment, freqs[comment]) for comment in freqs]\n", "comments.sort(key=lambda x: -x[1])\n", "\n", "# store the top comments in two axes\n", "# modify the LIMIT to find the top # of comments\n", "LIMIT = 10\n", "y, x = [], []\n", "for i in range(LIMIT-1, -1, -1):\n", " y.append(comments[i][0])\n", " x.append(comments[i][1])\n", "\n", "# plot the results\n", "plt.barh(y, x)\n", "plt.title(\"Most Frequent Comments on r/place 2023 Discussion Threads\")\n", "plt.xlabel(\"Comment Frequency\")\n", "plt.ylabel(\"Comment\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Data Preview Analysis**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**u/spez is unpopular.**
\n", "From the visualization, we can see that \"fuck u/spez\" is the most popular comment, with over 500 instances being tracked in the comment sections. The variant \"fuck spez\" is the second most popular comment, indicating a massive distrust for u/spez, which represents Steve Huffman's account, who is the CEO of Reddit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**There are numerous German users/comments on Reddit.**
\n", "Surprisingly, five of the top ten frequent comments are in German. Therefore, when performing sentiment analysis on r/place data, the NLP model can be inaccurate if it is unable to understand German comments.\n", "\n", "A workaround for this problem would be to identify if a comment is German or English. If a German NLP model and an English NLP model predict on comments of their respective to find the overall sentiment of the website." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }