Narrativa commited on
Commit
83312d5
1 Parent(s): d0eaf48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -15,7 +15,7 @@ tags:
15
  # Details of ByT5 - Base 🧠
16
 
17
  ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
18
- ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
19
  ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
20
  Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
21
  Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel*
@@ -26,7 +26,7 @@ Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang,
26
  [TweetQA](hhttps://huggingface.co/datasets/tweet_qa)
27
 
28
 
29
- With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, we present the first large-scale dataset for QA over social media data. To make the tweets are meaningful and contain interesting information, we gather tweets used by journalists to write news articles. We then ask human annotators to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, we allow the answers to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer.
30
 
31
  - Data Instances:
32
 
@@ -36,19 +36,19 @@ Sample
36
  {
37
  "Question": "who is the tallest host?",
38
  "Answer": ["sam bee","sam bee"],
39
- "Tweet": "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts\\\\u2014 Full Frontal (@FullFrontalSamB) January 22, 2017",
40
  "qid": "3554ee17d86b678be34c4dc2c04e334f"
41
  }
42
  ```
43
  - Data Fields:
44
 
45
- Question: a question based on information from a tweet
46
 
47
- Answer: list of possible answers from the tweet
48
 
49
- Tweet: source tweet
50
 
51
- qid: question id
52
 
53
 
54
 
@@ -77,7 +77,7 @@ def get_answer(question, context):
77
  return tokenizer.decode(output[0], skip_special_tokens=True)
78
 
79
 
80
- context = "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts\\\\u2014 Full Frontal (@FullFrontalSamB) January 22, 2017"
81
  question = "who is the tallest host?"
82
 
83
  get_answer(question, context)
 
15
  # Details of ByT5 - Base 🧠
16
 
17
  ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
18
+ ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is usable on a downstream task.
19
  ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
20
  Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
21
  Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel*
 
26
  [TweetQA](hhttps://huggingface.co/datasets/tweet_qa)
27
 
28
 
29
+ With social media becoming increasingly more popular, lots of news and real-time events are being covered. Developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have focused on formal text such as news and Wikipedia, we present the first large-scale dataset for QA over social media data. To make sure that the tweets are meaningful and contain interesting information, we gather tweets used by journalists to write news articles. We then ask human annotators to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD (in which the answers are extractive), we allow the answers to be abstractive. The task requires the model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer.
30
 
31
  - Data Instances:
32
 
 
36
  {
37
  "Question": "who is the tallest host?",
38
  "Answer": ["sam bee","sam bee"],
39
+ "Tweet": "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts\\\\\\\\u2014 Full Frontal (@FullFrontalSamB) January 22, 2017",
40
  "qid": "3554ee17d86b678be34c4dc2c04e334f"
41
  }
42
  ```
43
  - Data Fields:
44
 
45
+ *Question*: a question based on information from a tweet
46
 
47
+ *Answer*: list of possible answers from the tweet
48
 
49
+ *Tweet*: source tweet
50
 
51
+ *qid*: question id
52
 
53
 
54
 
 
77
  return tokenizer.decode(output[0], skip_special_tokens=True)
78
 
79
 
80
+ context = "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts\\\\\\\\u2014 Full Frontal (@FullFrontalSamB) January 22, 2017"
81
  question = "who is the tallest host?"
82
 
83
  get_answer(question, context)