### **r/place 2023 Sentiment Analysis Model**
This Jupyter notebook will fine-tune the DistilBERT model to perform sentiment analysis on Reddit comments in July 2023. Feel free to tweak the variables and code here. Credits are included at the end of the notebook.

**Install Dependencies**<br>
This notebook has been tested on Python 3.11.2 and uses Pytorch.

In [None]:
import csv
import datasets
import pandas as pd
import sklearn
import torch

**Load the Data**<br>
The target CSV file has Reddit comments in Column 0 and a score in Column 1. The scores correspond to the following sentiments: -1 = negative, 0 = neutral, 1 = positive. We will tweak the range from [-1, 1] to [0, 2] to match the model's labels.

In [None]:
# define the data path and store the comments in a list
data_path = "data/Reddit_Data.csv"
comments_and_scores = []

# read the csv and store each comment with its respective score
with open(data_path, "r", encoding="utf8") as f:
    csv_reader = csv.reader(f)
    next(csv_reader)
    for row in csv_reader:
        comment, score = row
        comments_and_scores.append((comment, int(score)+1))

print(comments_and_scores[0])
print(len(comments_and_scores))

**Separate Training and Testing Datasets**<br>
We need to separate these comments into training and testing datasets.

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(comments_and_scores,
                                       test_size=0.2,
                                       random_state=24)

In [None]:
print(len(train_set))
print(len(test_set))

In [None]:
print(train_set[0])
print(test_set[0])

In [None]:
# extract the training comments and scores
train_comments = [group[0] for group in train_set]
train_scores = [group[1] for group in train_set]

# extract the testing comments and scores
test_comments = [group[0] for group in test_set]
test_scores = [group[1] for group in test_set]

In [None]:
print(train_comments[0], train_scores[0])
print(test_comments[0], test_scores[0])

Now that we have the training and testing datasets, we will convert them into Pandas DataFrame objects.

In [None]:
# extract the train set from the list
train_set = {"text": train_comments, "labels": train_scores}
train_set = pd.DataFrame(train_set)
train_set = datasets.Dataset.from_pandas(train_set)
print(train_set)

In [None]:
# extract the test set from the list
test_set = {"text": test_comments, "labels": test_scores}
test_set = pd.DataFrame(test_set)
test_set = datasets.Dataset.from_pandas(test_set)
print(test_set)

**Tokenize the Data**<br>
Prior to training the model, we will tokenize the Reddit comments into small pieces to make it easier for the model to identify the comment's sentiment. Note: I disabled the warning for the fast tokenizer request as it will prevent the trainer from running the .train() function later in the notebook.

In [None]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=128)

tokenized_train = train_set.map(preprocess_function, batched=True)
tokenized_test = test_set.map(preprocess_function, batched=True)

In [None]:
# Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# Define DistilBERT as our base model:
from transformers import DistilBertForSequenceClassification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

In [None]:
from sklearn.metrics import precision_recall_fscore_support
def compute_metrics(p):
    preds = p.predictions.argmax(axis=1)
    return {
        'precision': precision_recall_fscore_support(p.label_ids, preds, average='weighted')[0],
        'recall': precision_recall_fscore_support(p.label_ids, preds, average='weighted')[1],
        'f1': precision_recall_fscore_support(p.label_ids, preds, average='weighted')[2],
    }

In [None]:
# Define a new Trainer with all the objects we constructed so far
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='args/',
    evaluation_strategy='epoch',
    save_total_limit=2,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='logs/',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
model.save_pretrained('saved_model/')