Edit model card


Bernice is a multilingual pre-trained encoder exclusively for Twitter data. The model was released with the EMNLP 2022 paper Bernice: A Multilingual Pre-trained Encoder for Twitter by Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.

This model card will contain more information soon. Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.

Model description


Training data

2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata). The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.

Training procedure

RoBERTa pre-training with BERT-base architecture.

Evaluation results


How to use

You can use this model for tweet representation. To use with HuggingFace PyTorch interface:

from transformers import AutoTokenizer, AutoModel
import re

# Load model
model = AutoModel("bernice")
tokenizer = AutoTokenizer.from_pretrained("bernice", model_max_length=128)

# Data
raw_tweets = [
  "So, Nintendo and Illimination's upcoming animated #SuperMarioBrosMovie is reportedly titled 'The Super Mario Bros. Movie'. Alrighty. :)",
  "AMLO se vio muy indignado porque propusieron al presidente de Ucrania para el premio nobel de la paz. ¿Qué no hay otros que luchen por la paz? ¿Acaso se quería proponer él?"

# Pre-process tweets for tokenizer
URL_RE = re.compile(r"https?:\/\/[\w\.\/\?\=\d&#%_:/-]+")
HANDLE_RE = re.compile(r"@\w+")
tweets = []
for t in raw_tweets:
  t = HANDLE_RE.sub("@USER", t)
  t = URL_RE.sub("HTTPURL", t)

with torch.no_grad():
  embeddings = model(tweets)

Limitations and bias


BibTeX entry and citation info


Downloads last month
Hosted inference API
Mask token: <mask>
This model can be loaded on the Inference API on-demand.