Narrativa commited on
Commit
325f416
β€’
1 Parent(s): 7308ae6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -20
README.md CHANGED
@@ -12,7 +12,7 @@ tags:
12
 
13
 
14
  # mBART-large-50 fine-tuned onpus100 and opusbook for English to Portuguese translation.
15
- [mBART-50](https://huggingface.co/facebook/mbart-large-50/) large fine-tuned on [opus100](https://huggingface.co/datasets/viewer/?dataset=opus100) and [opusbooks](https://huggingface.co/datasets/viewer/?dataset=opusbooks) datasets for **NMT** downstream task.
16
 
17
  # Details of mBART-50 🧠
18
 
@@ -29,38 +29,36 @@ where spans of text are replaced with a single mask token. The model is then tas
29
  The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence.
30
 
31
 
32
- ## Details of the downstream task (Sequence Classification as Text generation) - Dataset πŸ“š
33
 
34
- [tweets_hate_speech_detection](hhttps://huggingface.co/datasets/tweets_hate_speech_detection)
 
 
35
 
 
36
 
37
- The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
38
 
39
- Formally, given a training sample of tweets and labels, where label β€˜1’ denotes the tweet is racist/sexist and label β€˜0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.
40
 
41
- - Data Instances:
42
 
43
- The dataset contains a label denoting is the tweet a hate speech or not
44
 
45
- ```json
46
- {'label': 0, # not a hate speech
47
- 'tweet': ' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run'}
48
- ```
49
- - Data Fields:
50
-
51
- **label**: 1 - it is a hate speech, 0 - not a hate speech
52
 
53
- **tweet**: content of the tweet as a string
54
 
55
- - Data Splits:
56
 
57
- The data contains training data with **31962** entries
 
58
 
59
- ## Test set metrics 🧾
60
 
61
- We created a representative test set with the 5% of the entries.
 
 
62
 
63
- The dataset is so imbalanced and we got a **F1 score of 79.8**
64
 
65
 
66
 
 
12
 
13
 
14
  # mBART-large-50 fine-tuned onpus100 and opusbook for English to Portuguese translation.
15
+ [mBART-50](https://huggingface.co/facebook/mbart-large-50/) large fine-tuned on [opus100](https://huggingface.co/datasets/viewer/?dataset=opus100) dataset for **NMT** downstream task.
16
 
17
  # Details of mBART-50 🧠
18
 
 
29
  The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence.
30
 
31
 
32
+ ## Details of the downstream task (NMT) - Dataset πŸ“š
33
 
34
+ - **Homepage:** [Link](http://opus.nlpl.eu/opus-100.php)
35
+ - **Repository:** [GitHub](https://github.com/EdinburghNLP/opus-100-corpus)
36
+ - **Paper:** [ARXIV](https://arxiv.org/abs/2004.11867)
37
 
38
+ ### Dataset Summary
39
 
40
+ OPUS-100 is English-centric, meaning that all training pairs include English on either the source or target side. The corpus covers 100 languages (including English). Selected the languages based on the volume of parallel data available in OPUS.
41
 
 
42
 
43
+ ### Languages
44
 
45
+ OPUS-100 contains approximately 55M sentence pairs. Of the 99 language pairs, 44 have 1M sentence pairs of training data, 73 have at least 100k, and 95 have at least 10k.
46
 
47
+ ## Dataset Structure
 
 
 
 
 
 
48
 
 
49
 
50
+ ### Data Fields
51
 
52
+ - `src_tag`: `string` text in source language
53
+ - `tgt_tag`: `string` translation of source language in target language
54
 
55
+ ### Data Splits
56
 
57
+ The dataset is split into training, development, and test portions. Data was prepared by randomly sampled up to 1M sentence pairs per language pair for training and up to 2000 each for development and test. To ensure that there was no overlap (at the monolingual sentence level) between the training and development/test data, they applied a filter during sampling to exclude sentences that had already been sampled. Note that this was done cross-lingually so that, for instance, an English sentence in the Portuguese-English portion of the training data could not occur in the Hindi-English test set.
58
+
59
+ ## Test set metrics 🧾
60
 
61
+ We got a **BLEU score of 20.61**
62
 
63
 
64