Model Card for Model ID

#Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding. The model can effectively encode a tweet into topic-level embeddings. It can be used to estimate topic-level similarity between tweets.

Model Details

#Encoder leverage hashtags to learn inter-post topic relevance (for retrieval) via contrastive learning over 179M tweets. It was pre-trained on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag. We randomly noise the hashtags to avoid trivial representation. Please refers to https://github.com/albertan017/HICL for more details.

Model Description

Developed by: Hanzhuo Tan, Department of Computing, the Hong Kong Polytechnic University
Model type: Roberta
Language(s) (NLP): English
License: n.a
Finetuned from model [optional]: Bertweet

Model Sources [optional]

Repository: https://github.com/albertan017/HICL
Paper [optional]: HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding

Uses

from transformers import AutoModel, AutoTokenizer 

hashencoder = AutoModel.from_pretrained("albertan017/hashencoder")

tokenizer = AutoTokenizer.from_pretrained("albertan017/hashencoder")

tweet = "here's a sample tweet for encoding"

input_ids = torch.tensor([tokenizer.encode(tweet)])

with torch.no_grad():
    features = hashencoder(input_ids)  # Models outputs are now tuples

Bias, Risks, and Limitations

We do not inforce semantic similarity.

Training Details

Training Data

#Encoder is pre-trained on 15 GB of plain text from 179 million tweets and 4 billion tokens. Following the practice to pre-train BERTweet, the raw data was collected from the archived Twitter stream, containing 4TB of sampled tweets from January 2013 to June 2021. For data pre-processing, we ran the following steps. First, we employed fastText to extract English tweets and only kept tweets with hashtags. Then, low-frequency hashtags appearing in less than 100 tweets were further filtered out to alleviate sparsity. After that, we obtained a large-scale dataset containing 179M tweets, each has at least one hashtag, and hence corresponds to 180K hashtags in total.

Training Procedure

To leverage hashtag-gathered context in pre-training, we exploit contrastive learning and train #Encoder to identify pairwise posts sharing the same hashtag for gaining topic relevance.

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]