--- license: apache-2.0 language: - en metrics: - accuracy pipeline_tag: text-classification tags: - Crypto - Bitcoin - Sentiment Analysis - RoBERTa - NLP - Cryptocurrency --- # CryptoBERTRefined CryptoBERTRefined is a fine tuned model from [CryptoBERT by Elkulako](https://huggingface.co/ElKulako/cryptobert) model. # Classification Example Input: ```python !pip -q install transformers from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("AfterRain007/cryptobertRefined", use_fast=True) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3) pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=128, truncation=True, padding = 'max_length') post_3 = "Because Forex Markets have years of solidity and millions in budget, not to mention that they use their own datacenters. These lame cryptomarkets are all supported by some Amazon-cloud-style system. They delegate and delegate their security and in the end, get buttfucked..." post_2 = "Russian crypto market worth $500B despite bad regulation, says exec https://t.co/MZFoZIr2cN #CryptoCurrencies #Bitcoin #Technical Analysis" post_1 = "I really wouldn't be asking strangers such an important question. I'm sure you'd get well meaning answers but you probably need professional advice." df_posts = [post_1, post_2, post_3] preds = pipe(df_posts) print(preds) ``` Output: ```python [{'label': 'Neutral', 'score': 0.8427615165710449}, {'label': 'Bullish', 'score': 0.5444369912147522}, {'label': 'Bearish', 'score': 0.8388379812240601}] ``` # Training Corpus Total of 3.803 text have been labelled manually to fine tune the model, with consideration of non-duplicate and a minimum of 4 words after cleaning. The following website were used for our training dataset: 1. Bitcoin tweet dataset from [Kaggle Datasets](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) (Randomly picked). 2. Labelled crypto sentiment dataset from [SurgeAI](https://www.surgehq.ai/datasets/crypto-sentiment-dataset). 3. Reddit thread r/Bitcoin with the topic "Daily Discussion" (Randomly picked) Data augmentation was also performed to enrich the dataset, Back-Translation was used with Google Translate API on 10 language ('it', 'fr', "sv", "da", 'pt', 'id', 'pl', 'hr', "bg", "fi"). # Source Code See [Github](https://github.com/AfterRain007/cryptobertRefined) for the source code to finetune cryptoBERT model into cryptoBERTRefined. # Credit Credit where credit is due, thank you for all! 1. Muhaza Liebenlito, M.Si and Prof. Dr. Nur Inayah, M.Si. as my academic advisor. 2. Risky Amalia Marhariyadi for helping labelling the dataset. 3. SurgeAI for the dataset. 4. Mikolaj Kulakowski and Flavius Frasincar for the original CryptoBERT model. 5. Kaushik Suresh for the bitcoin tweets.