Narrativa
/

mbart-large-50-finetuned-opus-en-pt-translation

+---
+language:
+- en
+- es
+datasets:
+- opus100
+- opusbook
+tags:
+- translation
+---
+# mBART-large-50 fine-tuned onpus100 and opusbook for English to Portuguese translation.
+[mBART-50](https://huggingface.co/facebook/mbart-large-50/) large fine-tuned on [opus100](https://huggingface.co/datasets/viewer/?dataset=opus100) and [opusbooks](https://huggingface.co/datasets/viewer/?dataset=opusbooks) datasets for **NMT** downstream task.
+# Details of mBART-50 🧠
+mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the "Multilingual Denoising Pretraining" objective. It was introduced in [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper.
+mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning.
+Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below.
+**Multilingual Denoising Pretraining**: The model incorporates N languages by concatenating data:
+`D = {D1, ..., DN }` where each Di is a collection of monolingual documents in language `i`. The source documents are noised using two schemes,
+first randomly shuffling the original sentences' order, and second a novel in-filling scheme,
+where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text.
+35% of each instance's words are masked by random sampling a span length according to a Poisson distribution `(λ = 3.5)`.
+The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence.
+## Details of the downstream task (Sequence Classification as Text generation) - Dataset 📚
+[tweets_hate_speech_detection](hhttps://huggingface.co/datasets/tweets_hate_speech_detection)
+The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
+Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.
+- Data Instances:
+The dataset contains a label denoting is the tweet a hate speech or not
+```json
+{'label': 0,  # not a hate speech
+ 'tweet': ' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'}
+```
+- Data Fields:
+**label**: 1 - it is a hate speech, 0 - not a hate speech
+**tweet**: content of the tweet as a string
+- Data Splits:
+The data contains training data with **31962** entries
+## Test set metrics 🧾
+We created a representative test set with the 5% of the entries.
+The dataset is so imbalanced and we got a **F1 score of 79.8**
+## Model in Action 🚀
+```sh
+git clone https://github.com/huggingface/transformers.git
+pip install -q ./transformers
+```
+```python
+from transformers import AutoTokenizer, T5ForConditionalGeneration
+ckpt = 'Narrativa/byt5-base-tweet-hate-detection'
+tokenizer = AutoTokenizer.from_pretrained(ckpt)
+model = T5ForConditionalGeneration.from_pretrained(ckpt).to("cuda")
+def classify_tweet(tweet):
+    inputs = tokenizer([tweet], padding='max_length', truncation=True, max_length=512, return_tensors='pt')
+    input_ids = inputs.input_ids.to('cuda')
+    attention_mask = inputs.attention_mask.to('cuda')
+    output = model.generate(input_ids, attention_mask=attention_mask)
+    return tokenizer.decode(output[0], skip_special_tokens=True)
+classify_tweet('here goes your tweet...')
+```
+Created by: [Narrativa](https://www.narrativa.com/)
+About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning-based platform, builds and deploys natural language solutions. #NLG #AI