Narrativa commited on
Commit
7308ae6
β€’
1 Parent(s): fddd37d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - es
5
+ datasets:
6
+ - opus100
7
+ - opusbook
8
+ tags:
9
+ - translation
10
+
11
+ ---
12
+
13
+
14
+ # mBART-large-50 fine-tuned onpus100 and opusbook for English to Portuguese translation.
15
+ [mBART-50](https://huggingface.co/facebook/mbart-large-50/) large fine-tuned on [opus100](https://huggingface.co/datasets/viewer/?dataset=opus100) and [opusbooks](https://huggingface.co/datasets/viewer/?dataset=opusbooks) datasets for **NMT** downstream task.
16
+
17
+ # Details of mBART-50 🧠
18
+
19
+ mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the "Multilingual Denoising Pretraining" objective. It was introduced in [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper.
20
+
21
+
22
+ mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning.
23
+ Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below.
24
+ **Multilingual Denoising Pretraining**: The model incorporates N languages by concatenating data:
25
+ `D = {D1, ..., DN }` where each Di is a collection of monolingual documents in language `i`. The source documents are noised using two schemes,
26
+ first randomly shuffling the original sentences' order, and second a novel in-filling scheme,
27
+ where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text.
28
+ 35% of each instance's words are masked by random sampling a span length according to a Poisson distribution `(Ξ» = 3.5)`.
29
+ The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence.
30
+
31
+
32
+ ## Details of the downstream task (Sequence Classification as Text generation) - Dataset πŸ“š
33
+
34
+ [tweets_hate_speech_detection](hhttps://huggingface.co/datasets/tweets_hate_speech_detection)
35
+
36
+
37
+ The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
38
+
39
+ Formally, given a training sample of tweets and labels, where label β€˜1’ denotes the tweet is racist/sexist and label β€˜0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.
40
+
41
+ - Data Instances:
42
+
43
+ The dataset contains a label denoting is the tweet a hate speech or not
44
+
45
+ ```json
46
+ {'label': 0, # not a hate speech
47
+ 'tweet': ' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run'}
48
+ ```
49
+ - Data Fields:
50
+
51
+ **label**: 1 - it is a hate speech, 0 - not a hate speech
52
+
53
+ **tweet**: content of the tweet as a string
54
+
55
+ - Data Splits:
56
+
57
+ The data contains training data with **31962** entries
58
+
59
+ ## Test set metrics 🧾
60
+
61
+ We created a representative test set with the 5% of the entries.
62
+
63
+ The dataset is so imbalanced and we got a **F1 score of 79.8**
64
+
65
+
66
+
67
+ ## Model in Action πŸš€
68
+
69
+ ```sh
70
+ git clone https://github.com/huggingface/transformers.git
71
+ pip install -q ./transformers
72
+ ```
73
+
74
+ ```python
75
+ from transformers import AutoTokenizer, T5ForConditionalGeneration
76
+
77
+ ckpt = 'Narrativa/byt5-base-tweet-hate-detection'
78
+
79
+ tokenizer = AutoTokenizer.from_pretrained(ckpt)
80
+ model = T5ForConditionalGeneration.from_pretrained(ckpt).to("cuda")
81
+
82
+ def classify_tweet(tweet):
83
+
84
+ inputs = tokenizer([tweet], padding='max_length', truncation=True, max_length=512, return_tensors='pt')
85
+ input_ids = inputs.input_ids.to('cuda')
86
+ attention_mask = inputs.attention_mask.to('cuda')
87
+ output = model.generate(input_ids, attention_mask=attention_mask)
88
+ return tokenizer.decode(output[0], skip_special_tokens=True)
89
+
90
+
91
+ classify_tweet('here goes your tweet...')
92
+ ```
93
+
94
+ Created by: [Narrativa](https://www.narrativa.com/)
95
+
96
+ About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning-based platform, builds and deploys natural language solutions. #NLG #AI