Spaces:
Runtime error
Runtime error
Kewl
commited on
Milestone 4 (#15)
Browse files* adding documentation and comments
* adding google sites link
* Update README.md
* adding doc accuracy on test
- README.md +74 -0
- app.py +24 -14
- fine-tune-toxic-tweets.ipynb +7 -11
README.md
CHANGED
@@ -10,10 +10,84 @@ pinned: false
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
|
|
|
|
|
|
|
|
13 |
## Hugging Space Link
|
14 |
|
15 |
https://huggingface.co/spaces/ac8736/sentiment-analysis-app
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
## Intructions on Installing Docker on Mac
|
18 |
|
19 |
1. Go to the Docker Desktop install page and select the appropriate chip for your Mac device. If you are on Windows, there is another set of instructions you have to follow.
|
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
13 |
+
## Google Sites Link
|
14 |
+
|
15 |
+
https://sites.google.com/nyu.edu/sentiment-analysis-app/home
|
16 |
+
|
17 |
## Hugging Space Link
|
18 |
|
19 |
https://huggingface.co/spaces/ac8736/sentiment-analysis-app
|
20 |
|
21 |
+
## Model and Problem
|
22 |
+
|
23 |
+
The problem we are trying to tackle is classification of sentiments on a given text. The goal was to evaluate the toxicity class of a text, and identify it as either toxic, severely toxic, obscene, insult, threat, identity hate. The model DistilBert was fine tuned with a training set from Kaggle's Toxic Tweets competition for multilabel classification on the provided labels.
|
24 |
+
|
25 |
+
## Model Accuracy on a Test Set
|
26 |
+
|
27 |
+
Model was evaluated on a test set (20% from the original train.csv file) with an accuracy of 93.282%.
|
28 |
+
|
29 |
+
```python
|
30 |
+
train_texts, test_texts, train_labels, test_labels = train_test_split(train_texts, train_labels, test_size=.2)
|
31 |
+
|
32 |
+
predictions = []
|
33 |
+
for text in test_texts:
|
34 |
+
batch = tokenizer(text, truncation=True, padding='max_length', return_tensors="pt").to(device)
|
35 |
+
with torch.no_grad():
|
36 |
+
outputs = classifier(**batch)
|
37 |
+
prediction = torch.sigmoid(outputs.logits)
|
38 |
+
prediction = (prediction > 0.5).float()
|
39 |
+
prediction = prediction.cpu().detach().numpy().tolist()[0]
|
40 |
+
predictions.append(prediction)
|
41 |
+
|
42 |
+
print(accuracy_score(test_labels, predictions))
|
43 |
+
```
|
44 |
+
|
45 |
+
## Expected Output
|
46 |
+
|
47 |
+
When using a pretrained model from Hugging Face, below are the expected output. Depending on the model, the label value can be different. But generally, the models follow this format using the pipeline API.
|
48 |
+
|
49 |
+
```json
|
50 |
+
{
|
51 |
+
"label": "POS",
|
52 |
+
"score": "0.8624%"
|
53 |
+
}
|
54 |
+
```
|
55 |
+
|
56 |
+
When using the fine tuned model, the output is the following. There are 6 items that is returned, each as an object with label and score. Each item represents a label and its corresponding probability score.
|
57 |
+
|
58 |
+
```json
|
59 |
+
[
|
60 |
+
{
|
61 |
+
"label": "toxic",
|
62 |
+
"score": 0.01677067019045353
|
63 |
+
},
|
64 |
+
{
|
65 |
+
"label": "obscene",
|
66 |
+
"score": 0.001478900434449315
|
67 |
+
},
|
68 |
+
{
|
69 |
+
"label": "insult",
|
70 |
+
"score": 0.0005515297525562346
|
71 |
+
},
|
72 |
+
{
|
73 |
+
"label": "threat",
|
74 |
+
"score": 0.0002597073616925627
|
75 |
+
},
|
76 |
+
{
|
77 |
+
"label": "identity hate",
|
78 |
+
"score": 0.00010280739661538973
|
79 |
+
},
|
80 |
+
{
|
81 |
+
"label": "severely toxic",
|
82 |
+
"score": 0.000017059319361578673
|
83 |
+
}
|
84 |
+
]
|
85 |
+
```
|
86 |
+
|
87 |
+
## Video Demonstrating the App
|
88 |
+
|
89 |
+
https://user-images.githubusercontent.com/87680132/235007119-a69ea9de-5331-4878-9ba4-e8fad9b0091b.mp4
|
90 |
+
|
91 |
## Intructions on Installing Docker on Mac
|
92 |
|
93 |
1. Go to the Docker Desktop install page and select the appropriate chip for your Mac device. If you are on Windows, there is another set of instructions you have to follow.
|
app.py
CHANGED
@@ -3,31 +3,38 @@ from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassifica
|
|
3 |
import pandas as pd
|
4 |
import torch
|
5 |
|
|
|
6 |
def map_label(prediction):
|
7 |
-
labels = ["toxic", "severe toxic", "obscene", "threat", "insult", "identity hate"]
|
8 |
output = []
|
9 |
-
for predict, labels in (zip(prediction, labels)):
|
10 |
output.append({'label': labels, 'score': predict})
|
11 |
return output
|
12 |
|
|
|
13 |
def score(item):
|
14 |
return item['score']
|
15 |
|
|
|
|
|
|
|
16 |
st.title("Sentiment Analysis App")
|
17 |
-
|
18 |
text = st.text_area("Input text to get sentiment.", "You are a nice person!")
|
19 |
-
|
20 |
model = st.selectbox(
|
21 |
'Select the model you want to use below.',
|
22 |
-
("ac8736/toxic-tweets-fine-tuned-distilbert",
|
23 |
-
|
|
|
|
|
24 |
st.write('You selected:', model)
|
25 |
|
|
|
26 |
if st.button("Get Sentiment"):
|
27 |
-
if model != "ac8736/toxic-tweets-fine-tuned-distilbert":
|
|
|
28 |
classifier = pipeline(model=model)
|
29 |
prediction = classifier(text)[0]["label"]
|
30 |
-
if model == "distilbert-base-uncased-finetuned-sst-2-english":
|
31 |
sentiment = prediction
|
32 |
st.write(f"The sentiment is {sentiment}.")
|
33 |
elif model == "cardiffnlp/twitter-roberta-base-sentiment":
|
@@ -39,16 +46,19 @@ if st.button("Get Sentiment"):
|
|
39 |
elif model == "ProsusAI/finbert":
|
40 |
sentiment = prediction.upper()
|
41 |
st.write(f"The sentiment is {sentiment}.")
|
42 |
-
else:
|
|
|
|
|
43 |
classifier = AutoModelForSequenceClassification.from_pretrained(model)
|
44 |
tokenizer = AutoTokenizer.from_pretrained(model)
|
45 |
text_token = tokenizer(text, return_tensors="pt")
|
46 |
output = classifier(**text_token)
|
47 |
-
prediction = torch.sigmoid(output.logits)*100
|
48 |
-
prediction = prediction.detach().numpy().tolist()[0]
|
49 |
-
labels = map_label(prediction)
|
50 |
-
labels.sort(key=score, reverse=True)
|
|
|
51 |
df = pd.DataFrame([(text, labels[0]['label'], f"{round(labels[0]['score'], 3)}%", labels[1]['label'], f"{round(labels[1]['score'], 3)}%")], columns=('tweet/text','label 1', 'score 1', 'label 2', 'score 2'))
|
52 |
-
st.table(df)
|
53 |
st.write("Visit https://huggingface.co/ac8736/toxic-tweets-fine-tuned-distilbert for more information about the model and to view all outputs.")
|
54 |
|
|
|
3 |
import pandas as pd
|
4 |
import torch
|
5 |
|
6 |
+
# function to map labels to prediction
|
7 |
def map_label(prediction):
|
8 |
+
labels = ["toxic", "severe toxic", "obscene", "threat", "insult", "identity hate"] # the labels for the toxic tweets dataset
|
9 |
output = []
|
10 |
+
for predict, labels in (zip(prediction, labels)): # zip the prediction and labels together and loop through
|
11 |
output.append({'label': labels, 'score': predict})
|
12 |
return output
|
13 |
|
14 |
+
# sort labels by score in descending order
|
15 |
def score(item):
|
16 |
return item['score']
|
17 |
|
18 |
+
# steamlit app that allows users to input text through a text area
|
19 |
+
# and select a model from a dropdown menu
|
20 |
+
# the app then outputs the labels
|
21 |
st.title("Sentiment Analysis App")
|
|
|
22 |
text = st.text_area("Input text to get sentiment.", "You are a nice person!")
|
|
|
23 |
model = st.selectbox(
|
24 |
'Select the model you want to use below.',
|
25 |
+
("ac8736/toxic-tweets-fine-tuned-distilbert",
|
26 |
+
"distilbert-base-uncased-finetuned-sst-2-english",
|
27 |
+
"cardiffnlp/twitter-roberta-base-sentiment",
|
28 |
+
"finiteautomata/bertweet-base-sentiment-analysis", "ProsusAI/finbert"))
|
29 |
st.write('You selected:', model)
|
30 |
|
31 |
+
# button to get the sentiment
|
32 |
if st.button("Get Sentiment"):
|
33 |
+
if model != "ac8736/toxic-tweets-fine-tuned-distilbert": # if the model is not the toxic tweets model
|
34 |
+
# load model using pipeline and get prediction
|
35 |
classifier = pipeline(model=model)
|
36 |
prediction = classifier(text)[0]["label"]
|
37 |
+
if model == "distilbert-base-uncased-finetuned-sst-2-english": # if statements to maps the prediction to the correct sentiment
|
38 |
sentiment = prediction
|
39 |
st.write(f"The sentiment is {sentiment}.")
|
40 |
elif model == "cardiffnlp/twitter-roberta-base-sentiment":
|
|
|
46 |
elif model == "ProsusAI/finbert":
|
47 |
sentiment = prediction.upper()
|
48 |
st.write(f"The sentiment is {sentiment}.")
|
49 |
+
else:
|
50 |
+
# load model using AutoModelForSequenceClassification and get prediction
|
51 |
+
# map the prediction and display the results in a table
|
52 |
classifier = AutoModelForSequenceClassification.from_pretrained(model)
|
53 |
tokenizer = AutoTokenizer.from_pretrained(model)
|
54 |
text_token = tokenizer(text, return_tensors="pt")
|
55 |
output = classifier(**text_token)
|
56 |
+
prediction = torch.sigmoid(output.logits)*100 # convert logits to a percentage
|
57 |
+
prediction = prediction.detach().numpy().tolist()[0] # convert prediction to a list
|
58 |
+
labels = map_label(prediction) # map the labels
|
59 |
+
labels.sort(key=score, reverse=True) # sort the labels by score in descending order
|
60 |
+
|
61 |
df = pd.DataFrame([(text, labels[0]['label'], f"{round(labels[0]['score'], 3)}%", labels[1]['label'], f"{round(labels[1]['score'], 3)}%")], columns=('tweet/text','label 1', 'score 1', 'label 2', 'score 2'))
|
62 |
+
st.table(df) # display the results in a table
|
63 |
st.write("Visit https://huggingface.co/ac8736/toxic-tweets-fine-tuned-distilbert for more information about the model and to view all outputs.")
|
64 |
|
fine-tune-toxic-tweets.ipynb
CHANGED
@@ -25,7 +25,6 @@
|
|
25 |
"outputs": [],
|
26 |
"source": [
|
27 |
"# importing necessary libraries\n",
|
28 |
-
"\n",
|
29 |
"import torch \n",
|
30 |
"from torch.utils.data import Dataset\n",
|
31 |
"from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification\n",
|
@@ -239,10 +238,10 @@
|
|
239 |
}
|
240 |
],
|
241 |
"source": [
|
242 |
-
"#
|
243 |
-
"\n",
|
244 |
"model_name = \"distilbert-base-uncased\"\n",
|
245 |
"\n",
|
|
|
246 |
"df = pd.read_csv(\"train.csv\")\n",
|
247 |
"train_texts = df[\"comment_text\"].values\n",
|
248 |
"train_labels = df[df.columns[2:]].values\n",
|
@@ -259,7 +258,6 @@
|
|
259 |
"outputs": [],
|
260 |
"source": [
|
261 |
"# splitting up the data into training and validation sets\n",
|
262 |
-
"\n",
|
263 |
"train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)"
|
264 |
]
|
265 |
},
|
@@ -322,10 +320,10 @@
|
|
322 |
},
|
323 |
"outputs": [],
|
324 |
"source": [
|
325 |
-
"#
|
326 |
-
"\n",
|
327 |
"tokenizer = DistilBertTokenizerFast.from_pretrained(model_name, max_length=1024)\n",
|
328 |
"\n",
|
|
|
329 |
"class ToxicDataset(Dataset):\n",
|
330 |
" def __init__(self, texts, labels):\n",
|
331 |
" self.texts = texts\n",
|
@@ -370,13 +368,14 @@
|
|
370 |
"source": [
|
371 |
"# creating a dataloader for training and custom dataset\n",
|
372 |
"# device is set in order to use GPU for training, adjust code accordingly if GPU is not available\n",
|
373 |
-
"\n",
|
374 |
"device = torch.device('cuda')\n",
|
375 |
"\n",
|
|
|
376 |
"model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6, problem_type=\"multi_label_classification\")\n",
|
377 |
"model.to(device)\n",
|
378 |
"model.train()\n",
|
379 |
"\n",
|
|
|
380 |
"train_dataset = ToxicDataset(train_texts, train_labels)\n",
|
381 |
"train_dataloader = DataLoader(train_dataset, batch_size=16)"
|
382 |
]
|
@@ -394,7 +393,6 @@
|
|
394 |
"outputs": [],
|
395 |
"source": [
|
396 |
"# getting the optimizer and setting the number of epochs\n",
|
397 |
-
"\n",
|
398 |
"optim = AdamW(model.parameters(), lr=5e-5)\n",
|
399 |
"num_train_epochs = 1"
|
400 |
]
|
@@ -408,7 +406,6 @@
|
|
408 |
"outputs": [],
|
409 |
"source": [
|
410 |
"# training the model\n",
|
411 |
-
"\n",
|
412 |
"for epoch in range(num_train_epochs):\n",
|
413 |
" for batch in train_dataloader:\n",
|
414 |
" optim.zero_grad()\n",
|
@@ -431,6 +428,7 @@
|
|
431 |
},
|
432 |
"outputs": [],
|
433 |
"source": [
|
|
|
434 |
"model.eval()"
|
435 |
]
|
436 |
},
|
@@ -456,7 +454,6 @@
|
|
456 |
],
|
457 |
"source": [
|
458 |
"# testing a predication on a single example from the training set\n",
|
459 |
-
"\n",
|
460 |
"X_train = [\"COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK\"]\n",
|
461 |
"batch = tokenizer(X_train, truncation=True, padding='max_length', return_tensors=\"pt\").to(device)\n",
|
462 |
"\n",
|
@@ -475,7 +472,6 @@
|
|
475 |
"outputs": [],
|
476 |
"source": [
|
477 |
"# saving the model and its tokenizer\n",
|
478 |
-
"\n",
|
479 |
"model.save_pretrained(\"pretrained_model\")\n",
|
480 |
"tokenizer.save_pretrained(\"model_tokenizer\")"
|
481 |
]
|
|
|
25 |
"outputs": [],
|
26 |
"source": [
|
27 |
"# importing necessary libraries\n",
|
|
|
28 |
"import torch \n",
|
29 |
"from torch.utils.data import Dataset\n",
|
30 |
"from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification\n",
|
|
|
238 |
}
|
239 |
],
|
240 |
"source": [
|
241 |
+
"# define the model name\n",
|
|
|
242 |
"model_name = \"distilbert-base-uncased\"\n",
|
243 |
"\n",
|
244 |
+
"# reading in the data and splitting into features and labels\n",
|
245 |
"df = pd.read_csv(\"train.csv\")\n",
|
246 |
"train_texts = df[\"comment_text\"].values\n",
|
247 |
"train_labels = df[df.columns[2:]].values\n",
|
|
|
258 |
"outputs": [],
|
259 |
"source": [
|
260 |
"# splitting up the data into training and validation sets\n",
|
|
|
261 |
"train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)"
|
262 |
]
|
263 |
},
|
|
|
320 |
},
|
321 |
"outputs": [],
|
322 |
"source": [
|
323 |
+
"# getting the tokenizer\n",
|
|
|
324 |
"tokenizer = DistilBertTokenizerFast.from_pretrained(model_name, max_length=1024)\n",
|
325 |
"\n",
|
326 |
+
"# creating a custom dataset for training\n",
|
327 |
"class ToxicDataset(Dataset):\n",
|
328 |
" def __init__(self, texts, labels):\n",
|
329 |
" self.texts = texts\n",
|
|
|
368 |
"source": [
|
369 |
"# creating a dataloader for training and custom dataset\n",
|
370 |
"# device is set in order to use GPU for training, adjust code accordingly if GPU is not available\n",
|
|
|
371 |
"device = torch.device('cuda')\n",
|
372 |
"\n",
|
373 |
+
"# download model and prepare it for training\n",
|
374 |
"model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6, problem_type=\"multi_label_classification\")\n",
|
375 |
"model.to(device)\n",
|
376 |
"model.train()\n",
|
377 |
"\n",
|
378 |
+
"# defining the dataset and dataloader\n",
|
379 |
"train_dataset = ToxicDataset(train_texts, train_labels)\n",
|
380 |
"train_dataloader = DataLoader(train_dataset, batch_size=16)"
|
381 |
]
|
|
|
393 |
"outputs": [],
|
394 |
"source": [
|
395 |
"# getting the optimizer and setting the number of epochs\n",
|
|
|
396 |
"optim = AdamW(model.parameters(), lr=5e-5)\n",
|
397 |
"num_train_epochs = 1"
|
398 |
]
|
|
|
406 |
"outputs": [],
|
407 |
"source": [
|
408 |
"# training the model\n",
|
|
|
409 |
"for epoch in range(num_train_epochs):\n",
|
410 |
" for batch in train_dataloader:\n",
|
411 |
" optim.zero_grad()\n",
|
|
|
428 |
},
|
429 |
"outputs": [],
|
430 |
"source": [
|
431 |
+
"# setting the model to evaluation mode\n",
|
432 |
"model.eval()"
|
433 |
]
|
434 |
},
|
|
|
454 |
],
|
455 |
"source": [
|
456 |
"# testing a predication on a single example from the training set\n",
|
|
|
457 |
"X_train = [\"COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK\"]\n",
|
458 |
"batch = tokenizer(X_train, truncation=True, padding='max_length', return_tensors=\"pt\").to(device)\n",
|
459 |
"\n",
|
|
|
472 |
"outputs": [],
|
473 |
"source": [
|
474 |
"# saving the model and its tokenizer\n",
|
|
|
475 |
"model.save_pretrained(\"pretrained_model\")\n",
|
476 |
"tokenizer.save_pretrained(\"model_tokenizer\")"
|
477 |
]
|