{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### **r/place 2023 Sentiment Analysis Model**\n", "This Jupyter notebook will fine-tune the DistilBERT model to perform sentiment analysis on Reddit comments in July 2023. Feel free to tweak the variables and code here. Credits are included at the end of the notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Install Dependencies**
\n", "This notebook has been tested on Python 3.11.2 and uses Pytorch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import datasets\n", "import pandas as pd\n", "import sklearn\n", "import torch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Load the Data**
\n", "The target CSV file has Reddit comments in Column 0 and a score in Column 1. The scores correspond to the following sentiments: -1 = negative, 0 = neutral, 1 = positive. We will tweak the range from [-1, 1] to [0, 2] to match the model's labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define the data path and store the comments in a list\n", "data_path = \"data/Reddit_Data.csv\"\n", "comments_and_scores = []\n", "\n", "# read the csv and store each comment with its respective score\n", "with open(data_path, \"r\", encoding=\"utf8\") as f:\n", " csv_reader = csv.reader(f)\n", " next(csv_reader)\n", " for row in csv_reader:\n", " comment, score = row\n", " comments_and_scores.append((comment, int(score)+1))\n", "\n", "print(comments_and_scores[0])\n", "print(len(comments_and_scores))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Separate Training and Testing Datasets**
\n", "We need to separate these comments into training and testing datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "train_set, test_set = train_test_split(comments_and_scores,\n", " test_size=0.2,\n", " random_state=24)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(len(train_set))\n", "print(len(test_set))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(train_set[0])\n", "print(test_set[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# extract the training comments and scores\n", "train_comments = [group[0] for group in train_set]\n", "train_scores = [group[1] for group in train_set]\n", "\n", "# extract the testing comments and scores\n", "test_comments = [group[0] for group in test_set]\n", "test_scores = [group[1] for group in test_set]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(train_comments[0], train_scores[0])\n", "print(test_comments[0], test_scores[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the training and testing datasets, we will convert them into Pandas DataFrame objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# extract the train set from the list\n", "train_set = {\"text\": train_comments, \"labels\": train_scores}\n", "train_set = pd.DataFrame(train_set)\n", "train_set = datasets.Dataset.from_pandas(train_set)\n", "print(train_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# extract the test set from the list\n", "test_set = {\"text\": test_comments, \"labels\": test_scores}\n", "test_set = pd.DataFrame(test_set)\n", "test_set = datasets.Dataset.from_pandas(test_set)\n", "print(test_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tokenize the Data**
\n", "Prior to training the model, we will tokenize the Reddit comments into small pieces to make it easier for the model to identify the comment's sentiment. Note: I disabled the warning for the fast tokenizer request as it will prevent the trainer from running the .train() function later in the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers import DistilBertTokenizer\n", "tokenizer = DistilBertTokenizer.from_pretrained(\"distilbert-base-uncased\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# prepare the text inputs for the model\n", "def preprocess_function(examples):\n", " return tokenizer(examples[\"text\"], truncation=True, max_length=128)\n", "\n", "tokenized_train = train_set.map(preprocess_function, batched=True)\n", "tokenized_test = test_set.map(preprocess_function, batched=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of padding\n", "from transformers import DataCollatorWithPadding\n", "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define DistilBERT as our base model:\n", "from transformers import DistilBertForSequenceClassification\n", "model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import precision_recall_fscore_support\n", "def compute_metrics(p):\n", " preds = p.predictions.argmax(axis=1)\n", " return {\n", " 'precision': precision_recall_fscore_support(p.label_ids, preds, average='weighted')[0],\n", " 'recall': precision_recall_fscore_support(p.label_ids, preds, average='weighted')[1],\n", " 'f1': precision_recall_fscore_support(p.label_ids, preds, average='weighted')[2],\n", " }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a new Trainer with all the objects we constructed so far\n", "from transformers import TrainingArguments, Trainer\n", "\n", "training_args = TrainingArguments(\n", " output_dir='args/',\n", " evaluation_strategy='epoch',\n", " save_total_limit=2,\n", " learning_rate=2e-5,\n", " per_device_train_batch_size=16,\n", " per_device_eval_batch_size=16,\n", " num_train_epochs=3,\n", " weight_decay=0.01,\n", " logging_dir='logs/',\n", ")\n", "\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=tokenized_train,\n", " eval_dataset=tokenized_test,\n", " tokenizer=tokenizer,\n", " data_collator=data_collator,\n", " compute_metrics=compute_metrics,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainer.train()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.save_pretrained('saved_model/')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }