YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
SocBERT model
Pretrained model on 20GB English tweets and 72GB Reddit comments using a masked language modeling (MLM) objective. The tweets are from Archive and collected from Twitter Streaming API. The Reddit comments are ramdonly sampled from all subreddits from 2015-2019. SocBERT-base was pretrained on 819M sequence blocks for 100K steps. SocBERT-final was pretrained on 929M (819M+110M) sequence blocks for 112K (100K+12K) steps. We benchmarked SocBERT, on 40 text classification tasks with social media data.
The experiment results can be found in our paper:
@inproceedings{socbert:2023,
title = {{SocBERT: A Pretrained Model for Social Media Text}},
author = {Yuting Guo and Abeed Sarker},
booktitle = {Proceedings of the Fourth Workshop on Insights from Negative Results in NLP},
year = {2023}
}
A base version of the model can be found at SocBERT-base.
- Downloads last month
- 12
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.