aadelucia commited on
Commit
ac77102
·
1 Parent(s): 3f8c41e
Files changed (1) hide show
  1. README.md +103 -7
README.md CHANGED
@@ -1,23 +1,101 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # Bernice
6
 
7
  Bernice is a multilingual pre-trained encoder exclusively for Twitter data.
8
- The model was released with the EMNLP 2022 paper *Bernice: A Multilingual Pre-trained Encoder for Twitter* by Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.
 
 
9
 
10
- This model card will contain more information *soon*. Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.
11
 
12
  # Model description
13
- TBD
 
 
 
 
 
 
 
14
 
15
  ## Training data
16
  2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
17
  The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
18
 
19
  ## Training procedure
20
- RoBERTa pre-training with BERT-base architecture.
21
 
22
  ## Evaluation results
23
  TBD
@@ -30,8 +108,8 @@ from transformers import AutoTokenizer, AutoModel
30
  import re
31
 
32
  # Load model
33
- model = AutoModel("bernice")
34
- tokenizer = AutoTokenizer.from_pretrained("bernice", model_max_length=128)
35
 
36
  # Data
37
  raw_tweets = [
@@ -57,4 +135,22 @@ with torch.no_grad():
57
  TBD
58
 
59
  ## BibTeX entry and citation info
60
- TBD
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - jhu-clsp/bernice-pretrain-data
5
+ language:
6
+ - en
7
+ - es
8
+ - pt
9
+ - ja
10
+ - ar
11
+ - in
12
+ - ko
13
+ - tr
14
+ - fr
15
+ - tl
16
+ - ru
17
+ - und
18
+ - it
19
+ - th
20
+ - de
21
+ - hi
22
+ - pl
23
+ - nl
24
+ - fa
25
+ - et
26
+ - ht
27
+ - ur
28
+ - sv
29
+ - ca
30
+ - el
31
+ - fi
32
+ - cs
33
+ - iw
34
+ - da
35
+ - vi
36
+ - zh
37
+ - ta
38
+ - ro
39
+ - no
40
+ - uk
41
+ - cy
42
+ - ne
43
+ - hu
44
+ - eu
45
+ - sl
46
+ - lv
47
+ - lt
48
+ - bn
49
+ - sr
50
+ - bg
51
+ - mr
52
+ - ml
53
+ - is
54
+ - te
55
+ - gu
56
+ - kn
57
+ - ps
58
+ - ckb
59
+ - si
60
+ - hy
61
+ - or
62
+ - pa
63
+ - am
64
+ - sd
65
+ - my
66
+ - ka
67
+ - km
68
+ - dv
69
+ - lo
70
+ - ug
71
+ - bo
72
  ---
73
 
74
  # Bernice
75
 
76
  Bernice is a multilingual pre-trained encoder exclusively for Twitter data.
77
+ The model was released with the EMNLP 2022 paper
78
+ [*Bernice: A Multilingual Pre-trained Encoder for Twitter*](https://aclanthology.org/2022.emnlp-main.415/) by
79
+ Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.
80
 
81
+ Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.
82
 
83
  # Model description
84
+ The language of Twitter differs significantly from that of other domains commonly included in large language model training.
85
+ While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
86
+ language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
87
+ or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language
88
+ model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
89
+ and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
90
+ adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is
91
+ more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
92
 
93
  ## Training data
94
  2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
95
  The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
96
 
97
  ## Training procedure
98
+ RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
99
 
100
  ## Evaluation results
101
  TBD
 
108
  import re
109
 
110
  # Load model
111
+ model = AutoModel("jhu-clsp/bernice")
112
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/bernice", model_max_length=128)
113
 
114
  # Data
115
  raw_tweets = [
 
135
  TBD
136
 
137
  ## BibTeX entry and citation info
138
+ ```
139
+ @inproceedings{delucia-etal-2022-bernice,
140
+ title = "Bernice: A Multilingual Pre-trained Encoder for {T}witter",
141
+ author = "DeLucia, Alexandra and
142
+ Wu, Shijie and
143
+ Mueller, Aaron and
144
+ Aguirre, Carlos and
145
+ Resnik, Philip and
146
+ Dredze, Mark",
147
+ booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
148
+ month = dec,
149
+ year = "2022",
150
+ address = "Abu Dhabi, United Arab Emirates",
151
+ publisher = "Association for Computational Linguistics",
152
+ url = "https://aclanthology.org/2022.emnlp-main.415",
153
+ pages = "6191--6205",
154
+ abstract = "The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.",
155
+ }
156
+ ```