CryptoBERT

CryptoBERT is a pre-trained NLP model to analyse the language and sentiments of cryptocurrency-related social media posts and messages. It is built by further training the cardiffnlp's Twitter-roBERTa-base language model on the cryptocurrency domain, using a corpus of over 3.2M unique cryptocurrency-related social media posts.

Training Corpus

CryptoBERT was trained on 3.2M social media posts about various cryptocurrencies. Only non-duplicate posts of length above 4 words were considered. The following communities were used as sources for our corpora:

(1) StockTwits - 1.875M posts about the top 100 cryptos by trading volume. Posts were collected from the 1st of November 2021 to the 16th of June 2022.

(2) Telegram - 664K posts from top 5 telegram groups: Binance, Bittrex, huobi global, Kucoin, OKEx. Data from 16.11.2020 to 30.01.2021. Courtesy of Anton.

(3) Reddit - 172K comments from various crypto investing threads, collected from May 2021 to May 2022

(4) Twitter - 496K posts with hashtags XBT, Bitcoin or BTC. Collected for May 2018. Courtesy of Paul.