aadelucia commited on
Commit
73134ca
1 Parent(s): 2b44272
Files changed (1) hide show
  1. README.md +23 -2
README.md CHANGED
@@ -93,12 +93,22 @@ more efficient compute- and data-wise to train completely on in-domain data with
93
  ## Training data
94
  2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
95
  The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
 
96
 
97
  ## Training procedure
98
  RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
99
 
100
  ## Evaluation results
101
- TBD
 
 
 
 
 
 
 
 
 
102
 
103
  # How to use
104
  You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
@@ -132,7 +142,18 @@ with torch.no_grad():
132
 
133
 
134
  # Limitations and bias
135
- TBD
 
 
 
 
 
 
 
 
 
 
 
136
 
137
  ## BibTeX entry and citation info
138
  ```
 
93
  ## Training data
94
  2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
95
  The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
96
+ See [Bernice pretrain dataset](https://huggingface.co/datasets/jhu-clsp/bernice-pretrain-data) for details.
97
 
98
  ## Training procedure
99
  RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
100
 
101
  ## Evaluation results
102
+ We evaluated Bernice on three Twitter benchmarks: [TweetEval](https://aclanthology.org/2020.findings-emnlp.148/), [Unified Multilingual Sentiment Analysis
103
+ Benchmark (UMSAB)](https://aclanthology.org/2022.lrec-1.27/), and [Multilingual Hate Speech](https://link.springer.com/chapter/10.1007/978-3-030-67670-4_26). Summary results are shown below, see the paper appendix
104
+ for details.
105
+
106
+ | | **Bernice** | **BERTweet** | **XLM-R** | **XLM-T** | **TwHIN-BERT-MLM** | **TwHIN-BERT** |
107
+ |---------|-------------|--------------|-----------|-----------|--------------------|----------------|
108
+ | TweetEval | 64.80 | **67.90** | 57.60 | 64.40 | 64.80 | 63.10 |
109
+ | UMSAB | **70.34** | - | 67.71 | 66.74 | 68.10 | 67.53 |
110
+ | Hate Speech | **76.20** | - | 74.54 | 73.31 | 73.41 | 74.32 |
111
+
112
 
113
  # How to use
114
  You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
 
142
 
143
 
144
  # Limitations and bias
145
+
146
+ **Presence of Hate Speech:** As with all social media data, there exists spam and hate speech.
147
+ We cleaned our data by filtering for tweet length, but the possibility of this spam remains.
148
+ Hate speech is difficult to detect, especially across languages and cultures thus we leave its removal for future work.
149
+
150
+ **Low-resource Language Evaluation:** Within languages, even with language sampling during training,
151
+ Bernice is still not exposed to the same variety of examples in low-resource languages as high-resource languages like English and Spanish.
152
+ It is unclear whether enough Twitter data exists in these languages, such as Tibetan and Telugu, to ever match the performance on high-resource languages.
153
+ Only models more efficient at generalizing can pave the way for better performance in the wide variety of languages in this low-resource category.
154
+
155
+ See the paper for a more detailed discussion.
156
+
157
 
158
  ## BibTeX entry and citation info
159
  ```