Narrativa commited on
Commit
da2f568
β€’
1 Parent(s): 4aeed0a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - tweets_hate_speech_detection
5
+ ---
6
+
7
+
8
+ # ByT5-base fine-tuned for Hate Speech Detection (on Tweets)
9
+ [ByT5](https://huggingface.co/google/byt5-base) base fine-tuned on [tweets hate speech detection](https://huggingface.co/datasets/tweets_hate_speech_detection) dataset for **Sequence Classification** downstream task.
10
+
11
+ # Details of ByT5 - Base
12
+
13
+ ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
14
+ ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
15
+ ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
16
+ Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
17
+ Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel*
18
+
19
+
20
+ ## Details of the downstream task (Sequence Classification as Text generation) - Dataset πŸ“š
21
+
22
+ [tweets_hate_speech_detection](hhttps://huggingface.co/datasets/tweets_hate_speech_detection)
23
+
24
+
25
+ The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
26
+
27
+ Formally, given a training sample of tweets and labels, where label β€˜1’ denotes the tweet is racist/sexist and label β€˜0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.
28
+
29
+ - Data Instances:
30
+
31
+ The dataset contains a label denoting is the tweet a hate speech or not
32
+
33
+ ```json
34
+ {'label': 0, # not a hate speech
35
+ 'tweet': ' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run'}
36
+ ```
37
+ - Data Fields:
38
+
39
+ **label**: 1 - it is a hate speech, 0 - not a hate speech
40
+
41
+ **tweet**: content of the tweet as a string
42
+
43
+ - Data Splits:
44
+
45
+ The data contains training data with **31962** entries
46
+
47
+ ## Test set metrics 🧾
48
+
49
+ We created a representative test set with the 5% of the entries.
50
+
51
+ The dataset is so imbalanced and we got a **F1 score of 79.8**
52
+
53
+
54
+
55
+ ## Model in Action πŸš€
56
+
57
+ ```sh
58
+ git clone https://github.com/huggingface/transformers.git
59
+ pip install -q transformers
60
+ ```
61
+
62
+ ```python
63
+ from transformers import AutoTokenizer, T5ForConditionalGeneration
64
+ tokenizer = AutoTokenizer.from_pretrained("Narrativa/byt5-base-tweet-hate-detection")
65
+
66
+ model = T5ForConditionalGeneration.from_pretrained("Narrativa/byt5-base-tweet-hate-detection").to("cuda")
67
+
68
+ def classify_tweet(tweet):
69
+
70
+ inputs = tokenizer([tweet], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
71
+ input_ids = inputs.input_ids.to("cuda")
72
+ attention_mask = inputs.attention_mask.to("cuda")
73
+
74
+ output = model.generate(input_ids, attention_mask=attention_mask)
75
+
76
+ return tokenizer.decode(output[0], skip_special_tokens=True)
77
+
78
+
79
+ classify_tweet('here goes your tweet...')
80
+ ```
81
+
82
+ > Created by [Narrativa](https://www.narrativa.com/)
83
+
84
+ > Made with <span style="color: #e25555;">&hearts;</span> in Spain