ElKulako commited on
Commit
c57bda0
·
1 Parent(s): 6c2f8ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -1,15 +1,16 @@
1
  # CryptoBERT
2
- CryptoBERT is a pre-trained NLP model to analyze language and sentiments of cryptocurrency-related social media posts and messages. It is built by further training the [cardiffnlp's Twitter-roBERTa-base](https://huggingface.co/cardiffnlp/twitter-roberta-base) language model on the cryptocurrency domain, using a corpus of over 3.2M unique cryptocurrency-related social media posts.
3
 
4
 
5
  ## Training Corpus
6
  CryptoBERT was trained on 3.2M social media posts about various cryptocurrencies. Only non-duplicate posts of length above 4 words were considered. The following communities were used as sources for our corpora:
7
 
8
- (1) StockTwits - 1.875M posts about top 100 cryptos by trading volume. Posts were collected from 1st of November 2021 to 16th June 2022.
 
9
 
10
  (2) Telegram - 664K posts from top 5 telegram groups: [Binance](https://t.me/binanceexchange), [Bittrex](https://t.me/BittrexGlobalEnglish), [huobi global](https://t.me/huobiglobalofficial), [Kucoin](https://t.me/Kucoin_Exchange), [OKEx](https://t.me/OKExOfficial_English).
11
  Data from 16.11.2020 to 30.01.2021. Courtesy of [Anton](https://www.kaggle.com/datasets/aagghh/crypto-telegram-groups).
12
 
13
  (3) Reddit - 172K comments from various crypto investing threads, collected from May 2021 to May 2022
14
 
15
- (4) Twitter - 496K posts with hashtags XBT, Bitcoin or BTC. Collected for May 2018. Courtesy of [Paul](https://www.kaggle.com/datasets/paul92s/bitcoin-tweets-14m).
 
1
  # CryptoBERT
2
+ CryptoBERT is a pre-trained NLP model to analyse the language and sentiments of cryptocurrency-related social media posts and messages. It is built by further training the [cardiffnlp's Twitter-roBERTa-base](https://huggingface.co/cardiffnlp/twitter-roberta-base) language model on the cryptocurrency domain, using a corpus of over 3.2M unique cryptocurrency-related social media posts.
3
 
4
 
5
  ## Training Corpus
6
  CryptoBERT was trained on 3.2M social media posts about various cryptocurrencies. Only non-duplicate posts of length above 4 words were considered. The following communities were used as sources for our corpora:
7
 
8
+
9
+ (1) StockTwits - 1.875M posts about the top 100 cryptos by trading volume. Posts were collected from the 1st of November 2021 to the 16th of June 2022.
10
 
11
  (2) Telegram - 664K posts from top 5 telegram groups: [Binance](https://t.me/binanceexchange), [Bittrex](https://t.me/BittrexGlobalEnglish), [huobi global](https://t.me/huobiglobalofficial), [Kucoin](https://t.me/Kucoin_Exchange), [OKEx](https://t.me/OKExOfficial_English).
12
  Data from 16.11.2020 to 30.01.2021. Courtesy of [Anton](https://www.kaggle.com/datasets/aagghh/crypto-telegram-groups).
13
 
14
  (3) Reddit - 172K comments from various crypto investing threads, collected from May 2021 to May 2022
15
 
16
+ (4) Twitter - 496K posts with hashtags XBT, Bitcoin or BTC. Collected for May 2018. Courtesy of [Paul](https://www.kaggle.com/datasets/paul92s/bitcoin-tweets-14m).