TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations

PRs Welcome arXiv

This repo contains models, code and pointers to datasets from our paper: TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations. [PDF] [HuggingFace Models]

Overview

TwHIN-BERT is a new multi-lingual Tweet language model that is trained on 7 billion Tweets from over 100 distinct languages. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision (e.g., MLM), but also with a social objective based on the rich social engagements within a Twitter Heterogeneous Information Network (TwHIN).

TwHIN-BERT can be used as a drop-in replacement for BERT in a variety of NLP and recommendation tasks. It not only outperforms similar models semantic understanding tasks such text classification), but also social recommendation tasks such as predicting user to Tweet engagement.

1. Pretrained Models

We initially release two pretrained TwHIN-BERT models (base and large) that are compatible wit the HuggingFace BERT models.

Model Size Download Link (๐Ÿค— HuggingFace)
TwHIN-BERT-base 280M parameters Twitter/TwHIN-BERT-base
TwHIN-BERT-large 550M parameters Twitter/TwHIN-BERT-large

To use these models in ๐Ÿค— Transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('Twitter/twhin-bert-large')
model = AutoModel.from_pretrained('Twitter/twhin-bert-large')
inputs = tokenizer("I'm using TwHIN-BERT! #TwHIN-BERT #NLP", return_tensors="pt")
outputs = model(**inputs)

Citation

If you use TwHIN-BERT or out datasets in your work, please cite the following:

@article{zhang2022twhin,
  title={TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations},
  author={Zhang, Xinyang and Malkov, Yury and Florez, Omar and Park, Serim and McWilliams, Brian and Han, Jiawei and El-Kishky, Ahmed},
  journal={arXiv preprint arXiv:2209.07562},
  year={2022}
}
Downloads last month
606
Safetensors
Model size
562M params
Tensor type
I64
ยท
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Twitter/twhin-bert-large

Finetunes
10 models

Spaces using Twitter/twhin-bert-large 2