Narrativa commited on
Commit
57d7e21
β€’
1 Parent(s): 56483b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -22
README.md CHANGED
@@ -22,37 +22,33 @@ Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang,
22
 
23
  ## Details of the downstream task (Question Answering) - Dataset πŸ“š
24
 
25
- [TweetQA](hhttps://huggingface.co/datasets/tweets_hate_speech_detection)
26
 
27
 
28
- The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
29
-
30
- Formally, given a training sample of tweets and labels, where label β€˜1’ denotes the tweet is racist/sexist and label β€˜0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.
31
 
32
  - Data Instances:
33
 
34
- The dataset contains a label denoting is the tweet a hate speech or not
35
 
36
  ```json
37
- {'label': 0, # not a hate speech
38
- 'tweet': ' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run'}
 
 
 
 
39
  ```
40
  - Data Fields:
41
 
42
- **label**: 1 - it is a hate speech, 0 - not a hate speech
43
-
44
- **tweet**: content of the tweet as a string
45
 
46
- - Data Splits:
47
 
48
- The data contains training data with **31962** entries
49
 
50
- ## Test set metrics 🧾
51
 
52
- We created a representative test set with the 5% of the entries.
53
-
54
- The dataset is so imbalanced and we got a **F1 score of 79.8**
55
-
56
 
57
 
58
  ## Model in Action πŸš€
@@ -65,21 +61,22 @@ pip install -q ./transformers
65
  ```python
66
  from transformers import AutoTokenizer, T5ForConditionalGeneration
67
 
68
- ckpt = 'Narrativa/byt5-base-tweet-hate-detection'
69
 
70
  tokenizer = AutoTokenizer.from_pretrained(ckpt)
71
- model = T5ForConditionalGeneration.from_pretrained(ckpt).to("cuda")
72
 
73
- def classify_tweet(tweet):
74
 
75
- inputs = tokenizer([tweet], padding='max_length', truncation=True, max_length=512, return_tensors='pt')
 
76
  input_ids = inputs.input_ids.to('cuda')
77
  attention_mask = inputs.attention_mask.to('cuda')
78
  output = model.generate(input_ids, attention_mask=attention_mask)
79
  return tokenizer.decode(output[0], skip_special_tokens=True)
80
 
81
 
82
- classify_tweet('here goes your tweet...')
83
  ```
84
 
85
  Created by: [Narrativa](https://www.narrativa.com/)
 
22
 
23
  ## Details of the downstream task (Question Answering) - Dataset πŸ“š
24
 
25
+ [TweetQA](hhttps://huggingface.co/datasets/tweet_qa)
26
 
27
 
28
+ With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, we present the first large-scale dataset for QA over social media data. To make the tweets are meaningful and contain interesting information, we gather tweets used by journalists to write news articles. We then ask human annotators to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, we allow the answers to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer.
 
 
29
 
30
  - Data Instances:
31
 
32
+ Sample
33
 
34
  ```json
35
+ {
36
+ "Question": "who is the tallest host?",
37
+ "Answer": ["sam bee","sam bee"],
38
+ "Tweet": "Don't believe @ConanOBrien's height lies. Sam Bee is the tallest host in late night. #alternativefacts\u2014 Full Frontal (@FullFrontalSamB) January 22, 2017",
39
+ "qid": "3554ee17d86b678be34c4dc2c04e334f"
40
+ }
41
  ```
42
  - Data Fields:
43
 
44
+ Question: a question based on information from a tweet
 
 
45
 
46
+ Answer: list of possible answers from the tweet
47
 
48
+ Tweet: source tweet
49
 
50
+ qid: question id
51
 
 
 
 
 
52
 
53
 
54
  ## Model in Action πŸš€
 
61
  ```python
62
  from transformers import AutoTokenizer, T5ForConditionalGeneration
63
 
64
+ ckpt = 'Narrativa/byt5-base-finetuned-tweet-qa'
65
 
66
  tokenizer = AutoTokenizer.from_pretrained(ckpt)
67
+ model = T5ForConditionalGeneration.from_pretrained(ckpt).to('cuda')
68
 
69
+ def get_answer(question, context):
70
 
71
+ input_text = 'question: %s context: %s' % (question, context)
72
+ inputs = tokenizer([input_text], return_tensors='pt')
73
  input_ids = inputs.input_ids.to('cuda')
74
  attention_mask = inputs.attention_mask.to('cuda')
75
  output = model.generate(input_ids, attention_mask=attention_mask)
76
  return tokenizer.decode(output[0], skip_special_tokens=True)
77
 
78
 
79
+ get_answer('here goes your question', 'And here the context/tweet...')
80
  ```
81
 
82
  Created by: [Narrativa](https://www.narrativa.com/)