Spaces:

ac8736
/

sentiment-analysis-app

Runtime error

App Files Files Community

Kewl commited on May 11, 2023

Commit

dead201

unverified ·

1 Parent(s): af65063

Milestone 4 (#15)

Browse files

* adding documentation and comments

* adding google sites link

* Update README.md

* adding doc accuracy on test

Files changed (3) hide show

README.md +74 -0
app.py +24 -14
fine-tune-toxic-tweets.ipynb +7 -11

README.md CHANGED Viewed

@@ -10,10 +10,84 @@ pinned: false
 license: mit
 ---
 ## Hugging Space Link
 https://huggingface.co/spaces/ac8736/sentiment-analysis-app
 ## Intructions on Installing Docker on Mac
 1. Go to the Docker Desktop install page and select the appropriate chip for your Mac device. If you are on Windows, there is another set of instructions you have to follow.

 license: mit
 ---
+## Google Sites Link
+https://sites.google.com/nyu.edu/sentiment-analysis-app/home
 ## Hugging Space Link
 https://huggingface.co/spaces/ac8736/sentiment-analysis-app
+## Model and Problem
+The problem we are trying to tackle is classification of sentiments on a given text. The goal was to evaluate the toxicity class of a text, and identify it as either toxic, severely toxic, obscene, insult, threat, identity hate. The model DistilBert was fine tuned with a training set from Kaggle's Toxic Tweets competition for multilabel classification on the provided labels.
+## Model Accuracy on a Test Set
+Model was evaluated on a test set (20% from the original train.csv file) with an accuracy of 93.282%.
+```python
+train_texts, test_texts, train_labels, test_labels = train_test_split(train_texts, train_labels, test_size=.2)
+predictions = []
+for text in test_texts:
+  batch = tokenizer(text, truncation=True, padding='max_length', return_tensors="pt").to(device)
+  with torch.no_grad():
+    outputs = classifier(**batch)
+    prediction = torch.sigmoid(outputs.logits)
+    prediction = (prediction > 0.5).float()
+    prediction = prediction.cpu().detach().numpy().tolist()[0]
+    predictions.append(prediction)
+print(accuracy_score(test_labels, predictions))
+```
+## Expected Output
+When using a pretrained model from Hugging Face, below are the expected output. Depending on the model, the label value can be different. But generally, the models follow this format using the pipeline API.
+```json
+{
+  "label": "POS",
+  "score": "0.8624%"
+}
+```
+When using the fine tuned model, the output is the following. There are 6 items that is returned, each as an object with label and score. Each item represents a label and its corresponding probability score.
+```json
+[
+  {
+    "label": "toxic",
+    "score": 0.01677067019045353
+  },
+  {
+    "label": "obscene",
+    "score": 0.001478900434449315
+  },
+  {
+    "label": "insult",
+    "score": 0.0005515297525562346
+  },
+  {
+    "label": "threat",
+    "score": 0.0002597073616925627
+  },
+  {
+    "label": "identity hate",
+    "score": 0.00010280739661538973
+  },
+  {
+    "label": "severely toxic",
+    "score": 0.000017059319361578673
+  }
+]
+```
+## Video Demonstrating the App
+https://user-images.githubusercontent.com/87680132/235007119-a69ea9de-5331-4878-9ba4-e8fad9b0091b.mp4
 ## Intructions on Installing Docker on Mac
 1. Go to the Docker Desktop install page and select the appropriate chip for your Mac device. If you are on Windows, there is another set of instructions you have to follow.

app.py CHANGED Viewed

@@ -3,31 +3,38 @@ from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassifica
 import pandas as pd
 import torch
 def map_label(prediction):
-    labels = ["toxic", "severe toxic", "obscene", "threat", "insult", "identity hate"]
     output = []
-    for predict, labels in (zip(prediction, labels)):
         output.append({'label': labels, 'score': predict})
     return output
 def score(item):
     return item['score']
 st.title("Sentiment Analysis App")
 text = st.text_area("Input text to get sentiment.", "You are a nice person!")
 model = st.selectbox(
     'Select the model you want to use below.',
-    ("ac8736/toxic-tweets-fine-tuned-distilbert", "distilbert-base-uncased-finetuned-sst-2-english", "cardiffnlp/twitter-roberta-base-sentiment", "finiteautomata/bertweet-base-sentiment-analysis", "ProsusAI/finbert"))
 st.write('You selected:', model)
 if st.button("Get Sentiment"):
-    if model != "ac8736/toxic-tweets-fine-tuned-distilbert":
         classifier = pipeline(model=model)
         prediction = classifier(text)[0]["label"]
-        if model == "distilbert-base-uncased-finetuned-sst-2-english":
             sentiment = prediction
             st.write(f"The sentiment is {sentiment}.")
         elif model == "cardiffnlp/twitter-roberta-base-sentiment":
@@ -39,16 +46,19 @@ if st.button("Get Sentiment"):
         elif model == "ProsusAI/finbert":
             sentiment = prediction.upper()
             st.write(f"The sentiment is {sentiment}.")
-    else:
         classifier = AutoModelForSequenceClassification.from_pretrained(model)
         tokenizer = AutoTokenizer.from_pretrained(model)
         text_token = tokenizer(text, return_tensors="pt")
         output = classifier(**text_token)
-        prediction = torch.sigmoid(output.logits)*100
-        prediction = prediction.detach().numpy().tolist()[0]
-        labels = map_label(prediction)
-        labels.sort(key=score, reverse=True)
         df = pd.DataFrame([(text, labels[0]['label'], f"{round(labels[0]['score'], 3)}%", labels[1]['label'], f"{round(labels[1]['score'], 3)}%")], columns=('tweet/text','label 1', 'score 1', 'label 2', 'score 2'))
-        st.table(df)
         st.write("Visit https://huggingface.co/ac8736/toxic-tweets-fine-tuned-distilbert for more information about the model and to view all outputs.")

 import pandas as pd
 import torch
+# function to map labels to prediction
 def map_label(prediction):
+    labels = ["toxic", "severe toxic", "obscene", "threat", "insult", "identity hate"] # the labels for the toxic tweets dataset
     output = []
+    for predict, labels in (zip(prediction, labels)): # zip the prediction and labels together and loop through
         output.append({'label': labels, 'score': predict})
     return output
+# sort labels by score in descending order
 def score(item):
     return item['score']
+# steamlit app that allows users to input text through a text area
+# and select a model from a dropdown menu
+# the app then outputs the labels
 st.title("Sentiment Analysis App")
 text = st.text_area("Input text to get sentiment.", "You are a nice person!")
 model = st.selectbox(
     'Select the model you want to use below.',
+    ("ac8736/toxic-tweets-fine-tuned-distilbert",
+     "distilbert-base-uncased-finetuned-sst-2-english",
+     "cardiffnlp/twitter-roberta-base-sentiment",
+     "finiteautomata/bertweet-base-sentiment-analysis", "ProsusAI/finbert"))
 st.write('You selected:', model)
+# button to get the sentiment
 if st.button("Get Sentiment"):
+    if model != "ac8736/toxic-tweets-fine-tuned-distilbert": # if the model is not the toxic tweets model
+        # load model using pipeline and get prediction
         classifier = pipeline(model=model)
         prediction = classifier(text)[0]["label"]
+        if model == "distilbert-base-uncased-finetuned-sst-2-english": # if statements to maps the prediction to the correct sentiment
             sentiment = prediction
             st.write(f"The sentiment is {sentiment}.")
         elif model == "cardiffnlp/twitter-roberta-base-sentiment":
         elif model == "ProsusAI/finbert":
             sentiment = prediction.upper()
             st.write(f"The sentiment is {sentiment}.")
+    else:
+        # load model using AutoModelForSequenceClassification and get prediction
+        # map the prediction and display the results in a table
         classifier = AutoModelForSequenceClassification.from_pretrained(model)
         tokenizer = AutoTokenizer.from_pretrained(model)
         text_token = tokenizer(text, return_tensors="pt")
         output = classifier(**text_token)
+        prediction = torch.sigmoid(output.logits)*100 # convert logits to a percentage
+        prediction = prediction.detach().numpy().tolist()[0] # convert prediction to a list
+        labels = map_label(prediction) # map the labels
+        labels.sort(key=score, reverse=True) # sort the labels by score in descending order
         df = pd.DataFrame([(text, labels[0]['label'], f"{round(labels[0]['score'], 3)}%", labels[1]['label'], f"{round(labels[1]['score'], 3)}%")], columns=('tweet/text','label 1', 'score 1', 'label 2', 'score 2'))
+        st.table(df) # display the results in a table
         st.write("Visit https://huggingface.co/ac8736/toxic-tweets-fine-tuned-distilbert for more information about the model and to view all outputs.")

fine-tune-toxic-tweets.ipynb CHANGED Viewed

@@ -25,7 +25,6 @@
       "outputs": [],
       "source": [
         "# importing necessary libraries\n",
-        "\n",
         "import torch \n",
         "from torch.utils.data import Dataset\n",
         "from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification\n",
@@ -239,10 +238,10 @@
         }
       ],
       "source": [
-        "# reading in the data and preprocessing the data to create appropriate training data\n",
-        "\n",
         "model_name = \"distilbert-base-uncased\"\n",
         "\n",
         "df = pd.read_csv(\"train.csv\")\n",
         "train_texts = df[\"comment_text\"].values\n",
         "train_labels = df[df.columns[2:]].values\n",
@@ -259,7 +258,6 @@
       "outputs": [],
       "source": [
         "# splitting up the data into training and validation sets\n",
-        "\n",
         "train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)"
       ]
     },
@@ -322,10 +320,10 @@
       },
       "outputs": [],
       "source": [
-        "# creating a custom dataset for training\n",
-        "\n",
         "tokenizer = DistilBertTokenizerFast.from_pretrained(model_name, max_length=1024)\n",
         "\n",
         "class ToxicDataset(Dataset):\n",
         "  def __init__(self, texts, labels):\n",
         "    self.texts = texts\n",
@@ -370,13 +368,14 @@
       "source": [
         "# creating a dataloader for training and custom dataset\n",
         "# device is set in order to use GPU for training, adjust code accordingly if GPU is not available\n",
-        "\n",
         "device = torch.device('cuda')\n",
         "\n",
         "model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6, problem_type=\"multi_label_classification\")\n",
         "model.to(device)\n",
         "model.train()\n",
         "\n",
         "train_dataset = ToxicDataset(train_texts, train_labels)\n",
         "train_dataloader = DataLoader(train_dataset, batch_size=16)"
       ]
@@ -394,7 +393,6 @@
       "outputs": [],
       "source": [
         "# getting the optimizer and setting the number of epochs\n",
-        "\n",
         "optim = AdamW(model.parameters(), lr=5e-5)\n",
         "num_train_epochs = 1"
       ]
@@ -408,7 +406,6 @@
       "outputs": [],
       "source": [
         "# training the model\n",
-        "\n",
         "for epoch in range(num_train_epochs):\n",
         "  for batch in train_dataloader:\n",
         "    optim.zero_grad()\n",
@@ -431,6 +428,7 @@
       },
       "outputs": [],
       "source": [
         "model.eval()"
       ]
     },
@@ -456,7 +454,6 @@
       ],
       "source": [
         "# testing a predication on a single example from the training set\n",
-        "\n",
         "X_train = [\"COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK\"]\n",
         "batch = tokenizer(X_train, truncation=True, padding='max_length', return_tensors=\"pt\").to(device)\n",
         "\n",
@@ -475,7 +472,6 @@
       "outputs": [],
       "source": [
         "# saving the model and its tokenizer\n",
-        "\n",
         "model.save_pretrained(\"pretrained_model\")\n",
         "tokenizer.save_pretrained(\"model_tokenizer\")"
       ]

       "outputs": [],
       "source": [
         "# importing necessary libraries\n",
         "import torch \n",
         "from torch.utils.data import Dataset\n",
         "from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification\n",
         }
       ],
       "source": [
+        "# define the model name\n",
         "model_name = \"distilbert-base-uncased\"\n",
         "\n",
+        "# reading in the data and splitting into features and labels\n",
         "df = pd.read_csv(\"train.csv\")\n",
         "train_texts = df[\"comment_text\"].values\n",
         "train_labels = df[df.columns[2:]].values\n",
       "outputs": [],
       "source": [
         "# splitting up the data into training and validation sets\n",
         "train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)"
       ]
     },
       },
       "outputs": [],
       "source": [
+        "# getting the tokenizer\n",
         "tokenizer = DistilBertTokenizerFast.from_pretrained(model_name, max_length=1024)\n",
         "\n",
+        "# creating a custom dataset for training\n",
         "class ToxicDataset(Dataset):\n",
         "  def __init__(self, texts, labels):\n",
         "    self.texts = texts\n",
       "source": [
         "# creating a dataloader for training and custom dataset\n",
         "# device is set in order to use GPU for training, adjust code accordingly if GPU is not available\n",
         "device = torch.device('cuda')\n",
         "\n",
+        "# download model and prepare it for training\n",
         "model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6, problem_type=\"multi_label_classification\")\n",
         "model.to(device)\n",
         "model.train()\n",
         "\n",
+        "# defining the dataset and dataloader\n",
         "train_dataset = ToxicDataset(train_texts, train_labels)\n",
         "train_dataloader = DataLoader(train_dataset, batch_size=16)"
       ]
       "outputs": [],
       "source": [
         "# getting the optimizer and setting the number of epochs\n",
         "optim = AdamW(model.parameters(), lr=5e-5)\n",
         "num_train_epochs = 1"
       ]
       "outputs": [],
       "source": [
         "# training the model\n",
         "for epoch in range(num_train_epochs):\n",
         "  for batch in train_dataloader:\n",
         "    optim.zero_grad()\n",
       },
       "outputs": [],
       "source": [
+        "# setting the model to evaluation mode\n",
         "model.eval()"
       ]
     },
       ],
       "source": [
         "# testing a predication on a single example from the training set\n",
         "X_train = [\"COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK\"]\n",
         "batch = tokenizer(X_train, truncation=True, padding='max_length', return_tensors=\"pt\").to(device)\n",
         "\n",
       "outputs": [],
       "source": [
         "# saving the model and its tokenizer\n",
         "model.save_pretrained(\"pretrained_model\")\n",
         "tokenizer.save_pretrained(\"model_tokenizer\")"
       ]