Kewl commited on
Commit
dead201
1 Parent(s): af65063

Milestone 4 (#15)

Browse files

* adding documentation and comments

* adding google sites link

* Update README.md

* adding doc accuracy on test

Files changed (3) hide show
  1. README.md +74 -0
  2. app.py +24 -14
  3. fine-tune-toxic-tweets.ipynb +7 -11
README.md CHANGED
@@ -10,10 +10,84 @@ pinned: false
10
  license: mit
11
  ---
12
 
 
 
 
 
13
  ## Hugging Space Link
14
 
15
  https://huggingface.co/spaces/ac8736/sentiment-analysis-app
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## Intructions on Installing Docker on Mac
18
 
19
  1. Go to the Docker Desktop install page and select the appropriate chip for your Mac device. If you are on Windows, there is another set of instructions you have to follow.
 
10
  license: mit
11
  ---
12
 
13
+ ## Google Sites Link
14
+
15
+ https://sites.google.com/nyu.edu/sentiment-analysis-app/home
16
+
17
  ## Hugging Space Link
18
 
19
  https://huggingface.co/spaces/ac8736/sentiment-analysis-app
20
 
21
+ ## Model and Problem
22
+
23
+ The problem we are trying to tackle is classification of sentiments on a given text. The goal was to evaluate the toxicity class of a text, and identify it as either toxic, severely toxic, obscene, insult, threat, identity hate. The model DistilBert was fine tuned with a training set from Kaggle's Toxic Tweets competition for multilabel classification on the provided labels.
24
+
25
+ ## Model Accuracy on a Test Set
26
+
27
+ Model was evaluated on a test set (20% from the original train.csv file) with an accuracy of 93.282%.
28
+
29
+ ```python
30
+ train_texts, test_texts, train_labels, test_labels = train_test_split(train_texts, train_labels, test_size=.2)
31
+
32
+ predictions = []
33
+ for text in test_texts:
34
+ batch = tokenizer(text, truncation=True, padding='max_length', return_tensors="pt").to(device)
35
+ with torch.no_grad():
36
+ outputs = classifier(**batch)
37
+ prediction = torch.sigmoid(outputs.logits)
38
+ prediction = (prediction > 0.5).float()
39
+ prediction = prediction.cpu().detach().numpy().tolist()[0]
40
+ predictions.append(prediction)
41
+
42
+ print(accuracy_score(test_labels, predictions))
43
+ ```
44
+
45
+ ## Expected Output
46
+
47
+ When using a pretrained model from Hugging Face, below are the expected output. Depending on the model, the label value can be different. But generally, the models follow this format using the pipeline API.
48
+
49
+ ```json
50
+ {
51
+ "label": "POS",
52
+ "score": "0.8624%"
53
+ }
54
+ ```
55
+
56
+ When using the fine tuned model, the output is the following. There are 6 items that is returned, each as an object with label and score. Each item represents a label and its corresponding probability score.
57
+
58
+ ```json
59
+ [
60
+ {
61
+ "label": "toxic",
62
+ "score": 0.01677067019045353
63
+ },
64
+ {
65
+ "label": "obscene",
66
+ "score": 0.001478900434449315
67
+ },
68
+ {
69
+ "label": "insult",
70
+ "score": 0.0005515297525562346
71
+ },
72
+ {
73
+ "label": "threat",
74
+ "score": 0.0002597073616925627
75
+ },
76
+ {
77
+ "label": "identity hate",
78
+ "score": 0.00010280739661538973
79
+ },
80
+ {
81
+ "label": "severely toxic",
82
+ "score": 0.000017059319361578673
83
+ }
84
+ ]
85
+ ```
86
+
87
+ ## Video Demonstrating the App
88
+
89
+ https://user-images.githubusercontent.com/87680132/235007119-a69ea9de-5331-4878-9ba4-e8fad9b0091b.mp4
90
+
91
  ## Intructions on Installing Docker on Mac
92
 
93
  1. Go to the Docker Desktop install page and select the appropriate chip for your Mac device. If you are on Windows, there is another set of instructions you have to follow.
app.py CHANGED
@@ -3,31 +3,38 @@ from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassifica
3
  import pandas as pd
4
  import torch
5
 
 
6
  def map_label(prediction):
7
- labels = ["toxic", "severe toxic", "obscene", "threat", "insult", "identity hate"]
8
  output = []
9
- for predict, labels in (zip(prediction, labels)):
10
  output.append({'label': labels, 'score': predict})
11
  return output
12
 
 
13
  def score(item):
14
  return item['score']
15
 
 
 
 
16
  st.title("Sentiment Analysis App")
17
-
18
  text = st.text_area("Input text to get sentiment.", "You are a nice person!")
19
-
20
  model = st.selectbox(
21
  'Select the model you want to use below.',
22
- ("ac8736/toxic-tweets-fine-tuned-distilbert", "distilbert-base-uncased-finetuned-sst-2-english", "cardiffnlp/twitter-roberta-base-sentiment", "finiteautomata/bertweet-base-sentiment-analysis", "ProsusAI/finbert"))
23
-
 
 
24
  st.write('You selected:', model)
25
 
 
26
  if st.button("Get Sentiment"):
27
- if model != "ac8736/toxic-tweets-fine-tuned-distilbert":
 
28
  classifier = pipeline(model=model)
29
  prediction = classifier(text)[0]["label"]
30
- if model == "distilbert-base-uncased-finetuned-sst-2-english":
31
  sentiment = prediction
32
  st.write(f"The sentiment is {sentiment}.")
33
  elif model == "cardiffnlp/twitter-roberta-base-sentiment":
@@ -39,16 +46,19 @@ if st.button("Get Sentiment"):
39
  elif model == "ProsusAI/finbert":
40
  sentiment = prediction.upper()
41
  st.write(f"The sentiment is {sentiment}.")
42
- else:
 
 
43
  classifier = AutoModelForSequenceClassification.from_pretrained(model)
44
  tokenizer = AutoTokenizer.from_pretrained(model)
45
  text_token = tokenizer(text, return_tensors="pt")
46
  output = classifier(**text_token)
47
- prediction = torch.sigmoid(output.logits)*100
48
- prediction = prediction.detach().numpy().tolist()[0]
49
- labels = map_label(prediction)
50
- labels.sort(key=score, reverse=True)
 
51
  df = pd.DataFrame([(text, labels[0]['label'], f"{round(labels[0]['score'], 3)}%", labels[1]['label'], f"{round(labels[1]['score'], 3)}%")], columns=('tweet/text','label 1', 'score 1', 'label 2', 'score 2'))
52
- st.table(df)
53
  st.write("Visit https://huggingface.co/ac8736/toxic-tweets-fine-tuned-distilbert for more information about the model and to view all outputs.")
54
 
 
3
  import pandas as pd
4
  import torch
5
 
6
+ # function to map labels to prediction
7
  def map_label(prediction):
8
+ labels = ["toxic", "severe toxic", "obscene", "threat", "insult", "identity hate"] # the labels for the toxic tweets dataset
9
  output = []
10
+ for predict, labels in (zip(prediction, labels)): # zip the prediction and labels together and loop through
11
  output.append({'label': labels, 'score': predict})
12
  return output
13
 
14
+ # sort labels by score in descending order
15
  def score(item):
16
  return item['score']
17
 
18
+ # steamlit app that allows users to input text through a text area
19
+ # and select a model from a dropdown menu
20
+ # the app then outputs the labels
21
  st.title("Sentiment Analysis App")
 
22
  text = st.text_area("Input text to get sentiment.", "You are a nice person!")
 
23
  model = st.selectbox(
24
  'Select the model you want to use below.',
25
+ ("ac8736/toxic-tweets-fine-tuned-distilbert",
26
+ "distilbert-base-uncased-finetuned-sst-2-english",
27
+ "cardiffnlp/twitter-roberta-base-sentiment",
28
+ "finiteautomata/bertweet-base-sentiment-analysis", "ProsusAI/finbert"))
29
  st.write('You selected:', model)
30
 
31
+ # button to get the sentiment
32
  if st.button("Get Sentiment"):
33
+ if model != "ac8736/toxic-tweets-fine-tuned-distilbert": # if the model is not the toxic tweets model
34
+ # load model using pipeline and get prediction
35
  classifier = pipeline(model=model)
36
  prediction = classifier(text)[0]["label"]
37
+ if model == "distilbert-base-uncased-finetuned-sst-2-english": # if statements to maps the prediction to the correct sentiment
38
  sentiment = prediction
39
  st.write(f"The sentiment is {sentiment}.")
40
  elif model == "cardiffnlp/twitter-roberta-base-sentiment":
 
46
  elif model == "ProsusAI/finbert":
47
  sentiment = prediction.upper()
48
  st.write(f"The sentiment is {sentiment}.")
49
+ else:
50
+ # load model using AutoModelForSequenceClassification and get prediction
51
+ # map the prediction and display the results in a table
52
  classifier = AutoModelForSequenceClassification.from_pretrained(model)
53
  tokenizer = AutoTokenizer.from_pretrained(model)
54
  text_token = tokenizer(text, return_tensors="pt")
55
  output = classifier(**text_token)
56
+ prediction = torch.sigmoid(output.logits)*100 # convert logits to a percentage
57
+ prediction = prediction.detach().numpy().tolist()[0] # convert prediction to a list
58
+ labels = map_label(prediction) # map the labels
59
+ labels.sort(key=score, reverse=True) # sort the labels by score in descending order
60
+
61
  df = pd.DataFrame([(text, labels[0]['label'], f"{round(labels[0]['score'], 3)}%", labels[1]['label'], f"{round(labels[1]['score'], 3)}%")], columns=('tweet/text','label 1', 'score 1', 'label 2', 'score 2'))
62
+ st.table(df) # display the results in a table
63
  st.write("Visit https://huggingface.co/ac8736/toxic-tweets-fine-tuned-distilbert for more information about the model and to view all outputs.")
64
 
fine-tune-toxic-tweets.ipynb CHANGED
@@ -25,7 +25,6 @@
25
  "outputs": [],
26
  "source": [
27
  "# importing necessary libraries\n",
28
- "\n",
29
  "import torch \n",
30
  "from torch.utils.data import Dataset\n",
31
  "from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification\n",
@@ -239,10 +238,10 @@
239
  }
240
  ],
241
  "source": [
242
- "# reading in the data and preprocessing the data to create appropriate training data\n",
243
- "\n",
244
  "model_name = \"distilbert-base-uncased\"\n",
245
  "\n",
 
246
  "df = pd.read_csv(\"train.csv\")\n",
247
  "train_texts = df[\"comment_text\"].values\n",
248
  "train_labels = df[df.columns[2:]].values\n",
@@ -259,7 +258,6 @@
259
  "outputs": [],
260
  "source": [
261
  "# splitting up the data into training and validation sets\n",
262
- "\n",
263
  "train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)"
264
  ]
265
  },
@@ -322,10 +320,10 @@
322
  },
323
  "outputs": [],
324
  "source": [
325
- "# creating a custom dataset for training\n",
326
- "\n",
327
  "tokenizer = DistilBertTokenizerFast.from_pretrained(model_name, max_length=1024)\n",
328
  "\n",
 
329
  "class ToxicDataset(Dataset):\n",
330
  " def __init__(self, texts, labels):\n",
331
  " self.texts = texts\n",
@@ -370,13 +368,14 @@
370
  "source": [
371
  "# creating a dataloader for training and custom dataset\n",
372
  "# device is set in order to use GPU for training, adjust code accordingly if GPU is not available\n",
373
- "\n",
374
  "device = torch.device('cuda')\n",
375
  "\n",
 
376
  "model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6, problem_type=\"multi_label_classification\")\n",
377
  "model.to(device)\n",
378
  "model.train()\n",
379
  "\n",
 
380
  "train_dataset = ToxicDataset(train_texts, train_labels)\n",
381
  "train_dataloader = DataLoader(train_dataset, batch_size=16)"
382
  ]
@@ -394,7 +393,6 @@
394
  "outputs": [],
395
  "source": [
396
  "# getting the optimizer and setting the number of epochs\n",
397
- "\n",
398
  "optim = AdamW(model.parameters(), lr=5e-5)\n",
399
  "num_train_epochs = 1"
400
  ]
@@ -408,7 +406,6 @@
408
  "outputs": [],
409
  "source": [
410
  "# training the model\n",
411
- "\n",
412
  "for epoch in range(num_train_epochs):\n",
413
  " for batch in train_dataloader:\n",
414
  " optim.zero_grad()\n",
@@ -431,6 +428,7 @@
431
  },
432
  "outputs": [],
433
  "source": [
 
434
  "model.eval()"
435
  ]
436
  },
@@ -456,7 +454,6 @@
456
  ],
457
  "source": [
458
  "# testing a predication on a single example from the training set\n",
459
- "\n",
460
  "X_train = [\"COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK\"]\n",
461
  "batch = tokenizer(X_train, truncation=True, padding='max_length', return_tensors=\"pt\").to(device)\n",
462
  "\n",
@@ -475,7 +472,6 @@
475
  "outputs": [],
476
  "source": [
477
  "# saving the model and its tokenizer\n",
478
- "\n",
479
  "model.save_pretrained(\"pretrained_model\")\n",
480
  "tokenizer.save_pretrained(\"model_tokenizer\")"
481
  ]
 
25
  "outputs": [],
26
  "source": [
27
  "# importing necessary libraries\n",
 
28
  "import torch \n",
29
  "from torch.utils.data import Dataset\n",
30
  "from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification\n",
 
238
  }
239
  ],
240
  "source": [
241
+ "# define the model name\n",
 
242
  "model_name = \"distilbert-base-uncased\"\n",
243
  "\n",
244
+ "# reading in the data and splitting into features and labels\n",
245
  "df = pd.read_csv(\"train.csv\")\n",
246
  "train_texts = df[\"comment_text\"].values\n",
247
  "train_labels = df[df.columns[2:]].values\n",
 
258
  "outputs": [],
259
  "source": [
260
  "# splitting up the data into training and validation sets\n",
 
261
  "train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)"
262
  ]
263
  },
 
320
  },
321
  "outputs": [],
322
  "source": [
323
+ "# getting the tokenizer\n",
 
324
  "tokenizer = DistilBertTokenizerFast.from_pretrained(model_name, max_length=1024)\n",
325
  "\n",
326
+ "# creating a custom dataset for training\n",
327
  "class ToxicDataset(Dataset):\n",
328
  " def __init__(self, texts, labels):\n",
329
  " self.texts = texts\n",
 
368
  "source": [
369
  "# creating a dataloader for training and custom dataset\n",
370
  "# device is set in order to use GPU for training, adjust code accordingly if GPU is not available\n",
 
371
  "device = torch.device('cuda')\n",
372
  "\n",
373
+ "# download model and prepare it for training\n",
374
  "model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6, problem_type=\"multi_label_classification\")\n",
375
  "model.to(device)\n",
376
  "model.train()\n",
377
  "\n",
378
+ "# defining the dataset and dataloader\n",
379
  "train_dataset = ToxicDataset(train_texts, train_labels)\n",
380
  "train_dataloader = DataLoader(train_dataset, batch_size=16)"
381
  ]
 
393
  "outputs": [],
394
  "source": [
395
  "# getting the optimizer and setting the number of epochs\n",
 
396
  "optim = AdamW(model.parameters(), lr=5e-5)\n",
397
  "num_train_epochs = 1"
398
  ]
 
406
  "outputs": [],
407
  "source": [
408
  "# training the model\n",
 
409
  "for epoch in range(num_train_epochs):\n",
410
  " for batch in train_dataloader:\n",
411
  " optim.zero_grad()\n",
 
428
  },
429
  "outputs": [],
430
  "source": [
431
+ "# setting the model to evaluation mode\n",
432
  "model.eval()"
433
  ]
434
  },
 
454
  ],
455
  "source": [
456
  "# testing a predication on a single example from the training set\n",
 
457
  "X_train = [\"COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK\"]\n",
458
  "batch = tokenizer(X_train, truncation=True, padding='max_length', return_tensors=\"pt\").to(device)\n",
459
  "\n",
 
472
  "outputs": [],
473
  "source": [
474
  "# saving the model and its tokenizer\n",
 
475
  "model.save_pretrained(\"pretrained_model\")\n",
476
  "tokenizer.save_pretrained(\"model_tokenizer\")"
477
  ]