jhu-clsp
/

bernice

Fill-Mask

Transformers

PyTorch

Safetensors

xlm-roberta

Inference Endpoints

Model card Files Files and versions Community

aadelucia commited on Feb 17, 2023

Commit

ac77102

1 Parent(s): 3f8c41e

updates

Browse files

Files changed (1) hide show

README.md +103 -7

README.md CHANGED Viewed

@@ -1,23 +1,101 @@
 ---
 license: mit
 ---
 # Bernice
 Bernice is a multilingual pre-trained encoder exclusively for Twitter data.
-The model was released with the EMNLP 2022 paper *Bernice: A Multilingual Pre-trained Encoder for Twitter* by Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.
-This model card will contain more information *soon*. Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.
 # Model description
-TBD
 ## Training data
 2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
 The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
 ## Training procedure
-RoBERTa pre-training with BERT-base architecture.
 ## Evaluation results
 TBD
@@ -30,8 +108,8 @@ from transformers import AutoTokenizer, AutoModel
 import re
 # Load model
-model = AutoModel("bernice")
-tokenizer = AutoTokenizer.from_pretrained("bernice", model_max_length=128)
 # Data
 raw_tweets = [
@@ -57,4 +135,22 @@ with torch.no_grad():
 TBD
 ## BibTeX entry and citation info
-TBD

 ---
 license: mit
+datasets:
+    - jhu-clsp/bernice-pretrain-data
+language:
+    - en
+    - es
+    - pt
+    - ja
+    - ar
+    - in
+    - ko
+    - tr
+	- fr
+	- tl
+	- ru
+	- und
+	- it
+	- th
+	- de
+	- hi
+	- pl
+	- nl
+	- fa
+	- et
+	- ht
+	- ur
+	- sv
+	- ca
+	- el
+	- fi
+	- cs
+	- iw
+	- da
+	- vi
+	- zh
+	- ta
+	- ro
+	- no
+	- uk
+	- cy
+	- ne
+	- hu
+	- eu
+	- sl
+	- lv
+	- lt
+	- bn
+	- sr
+	- bg
+	- mr
+	- ml
+	- is
+	- te
+	- gu
+	- kn
+	- ps
+	- ckb
+	- si
+	- hy
+	- or
+	- pa
+	- am
+	- sd
+	- my
+	- ka
+	- km
+	- dv
+	- lo
+	- ug
+	- bo
 ---
 # Bernice
 Bernice is a multilingual pre-trained encoder exclusively for Twitter data.
+The model was released with the EMNLP 2022 paper
+[*Bernice: A Multilingual Pre-trained Encoder for Twitter*](https://aclanthology.org/2022.emnlp-main.415/) by
+Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.
+Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.
 # Model description
+The language of Twitter differs significantly from that of other domains commonly included in large language model training.
+While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
+language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
+or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language
+model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
+and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
+adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is
+more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
 ## Training data
 2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
 The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
 ## Training procedure
+RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
 ## Evaluation results
 TBD
 import re
 # Load model
+model = AutoModel("jhu-clsp/bernice")
+tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/bernice", model_max_length=128)
 # Data
 raw_tweets = [
 TBD
 ## BibTeX entry and citation info
+```
+@inproceedings{delucia-etal-2022-bernice,
+    title = "Bernice: A Multilingual Pre-trained Encoder for {T}witter",
+    author = "DeLucia, Alexandra  and
+      Wu, Shijie  and
+      Mueller, Aaron  and
+      Aguirre, Carlos  and
+      Resnik, Philip  and
+      Dredze, Mark",
+    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
+    month = dec,
+    year = "2022",
+    address = "Abu Dhabi, United Arab Emirates",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.emnlp-main.415",
+    pages = "6191--6205",
+    abstract = "The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.",
+}
+```