--- license: apache-2.0 language: - en pipeline_tag: feature-extraction --- # Model Card for Model ID This model is an example on how to handle multi-target regression problem using llms. Model takes in tweet,stock ticker, month, last_price and volume for a stock (around the tweet was publish) and returns 1,2,3 and 7 day returns and 10 day annualized volatility. Model uses feature vectors output by the tweet text (mobile-bert output), numerical (last price and volume), and categorical(stock ticker and month) sub-components then are concatenated into a single feature vector which is fed into a final ouput layers. Used [google/mobilebert-uncased](https://huggingface.co/google/mobilebert-uncased) for text feature extraction (MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks). This is again a very very light model (100mb), used following dataset from [Kaggle](www.kaggle.com) called [Tweet Sentiment's Impact on Stock Returns (by THE DEVASTATOR)](https://www.kaggle.com/datasets/thedevastator/tweet-sentiment-s-impact-on-stock-returns). **Disclaimer: This model should not be used for trading. Data source is not verified, assumption is that data is synthetically generated. This is just an example how to handle multi-target regression problem**. Contact us for more info: support@cloudsummary.com ## Model Details ### Model Description Model takes in tweet,stock ticker, month, last_price and volume for a stock (around the tweet was publish) and returns 1,2,3 and 7 day returns and 10 day annualized volatility. Model uses feature vectors output by the tweet text (mobile-bert output), numerical (last price and volume), and categorical(stock ticker and month) sub-components then are concatenated into a single feature vector which is fed into a final ouput layers. Model is trainded on 600k rows. - **Developed by:** cssupport (support@cloudsummary.com) - **Model type:** Language model - **Language(s) (NLP):** English - **License:** Apache 2.0 - **Finetuned from model :** [google/mobilebert-uncased](https://huggingface.co/google/mobilebert-uncased) ### Model Sources Please refer [google/mobilebert-uncased](https://huggingface.co/google/mobilebert-uncased) for Model Sources. ## How to Get Started with the Model Use the code below to get started with the model. **hugging face library is currently not working (will fix it an update the model card). You will need to download the model and then load it manually. Use the code below** ```python from transformers import AutoTokenizer from sklearn.preprocessing import LabelEncoder import torch from sklearn.preprocessing import LabelEncoder import joblib # Initialize the BERT tokenizer tokenizer = AutoTokenizer.from_pretrained('google/mobilebert-uncased') # Load the model model = torch.load('pytorch_model.pt') #load the stock enoder #list of ticker supported - ['21CF', 'ASOS', 'AT&T', 'Adobe', 'Allianz', 'Amazon', 'American Express', 'Apple', 'AstraZeneca', 'Audi', 'Aviva', 'BASF', 'BMW', 'BP', 'Bank of America', 'Bayer', 'BlackRock', 'Boeing', 'Burberry', 'CBS', 'CVS Health', 'Cardinal Health', 'Carrefour', 'Chevron', 'Cisco', 'Citigroup', 'CocaCola', 'Colgate', 'Comcast', 'Costco', 'Danone', 'Deutsche Bank', 'Disney', 'Equinor', 'Expedia', 'Exxon', 'Facebook', 'FedEx', 'Ford', 'GSK', 'General Electric', 'Gillette', 'Goldman Sachs', 'Google', 'Groupon', 'H&M', 'HP', 'HSBC', 'Heineken', 'Home Depot', 'Honda', 'Hyundai', 'IBM', 'Intel', 'JPMorgan', 'John Deere', "Kellogg's", 'Kroger', "L'Oreal", 'Mastercard', "McDonald's", 'Microsoft', 'Morgan Stanley', 'Nestle', 'Netflix', 'Next', 'Nike', 'Nissan', 'Oracle', 'P&G', 'PayPal', 'Pepsi', 'Pfizer', 'Reuters', 'Ryanair', 'SAP', 'Samsung', 'Santander', 'Shell', 'Siemens', 'Sony', 'Starbucks', 'TMobile', 'Tesco', 'Thales', 'Toyota', 'TripAdvisor', 'UPS', 'Verizon', 'Viacom', 'Visa', 'Vodafone', 'Volkswagen', 'Walmart', 'Wells Fargo', 'Yahoo', 'adidas', 'bookingcom', 'eBay', 'easyJet', 'salesforce.com'] stock_encoder = joblib.load("stock_encoder.pkl") device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") def preprocess_text(raw_text): tweet_data = tokenizer.batch_encode_plus( [raw_text], padding=True, return_attention_mask=True, truncation=True, max_length=512 ) return tweet_data['input_ids'][0], tweet_data['attention_mask'][0] def make_prediction(tweet, stock, month, last_price, volume): # Preprocess the Data input_ids, attention_mask = preprocess_text(tweet) # LAST_PRICE, PX_VOLUME numerical_data = np.array([last_price, volume]) #STOCK and MONTH categorical_data = np.array([stock_encoder.transform([stock])[0], month]) # Convert them into PyTorch tensors input_ids = torch.tensor([input_ids]).to(device) attention_mask = torch.tensor([attention_mask]).to(device) numerical_data = torch.tensor([numerical_data], dtype=torch.float32).to(device) categorical_data = torch.tensor([categorical_data], dtype=torch.float32).to(device) # Run the model with torch.no_grad(): output_one_day, output_two_day, output_three_day, output_seven_day, output_vol_10d = model( input_ids=input_ids, attention_mask=attention_mask, numerical_data=numerical_data, categorical_data=categorical_data ) # Convert to readable format (in this example, convert to percentages) output_one_day = (output_one_day.item() * 100) # Convert tensor to Python float and then to percentage output_two_day = (output_two_day.item() * 100) output_three_day = (output_three_day.item() * 100) output_seven_day = (output_seven_day.item() * 100) return output_one_day, output_two_day, output_three_day, output_seven_day, output_vol_10d tweet = "Check out BURGUNDY REED AND BARTON 13 PC SET OF SLIVERWARE FORKS SPOONS KNIFES GRAVY SPOON via @eBay" stock = "eBay" month = 9 last_price = 38.46 volume = 9964979.0 output_one_day, output_two_day, output_three_day, output_seven_day, output_vol_10d = make_prediction( tweet, stock, month, last_price, volume) # Print outputs print(f"1 Day Return: {output_one_day}%") print(f"2 Day Return: {output_two_day}%") print(f"3 Day Return: {output_three_day}%") print(f"7 Day Return: {output_seven_day}%") print(f"10 Day Volatility: {output_vol_10d}") ``` ## Uses **Disclaimer: This model should not be used for trading. Data source is not verified, assumption is that data is synthetically generated. This is just an example how to handle multi-target regression problem**. ### Direct Use Could used in application where natural language is to be converted into SQL queries. **Disclaimer: This model should not be used for trading. Data source is not verified, assumption is that data is synthetically generated. This is just an example how to handle multi-target regression problem**. ### Out-of-Scope Use **Disclaimer: This model should not be used for trading. Data source is not verified, assumption is that data is synthetically generated. This is just an example how to handle multi-target regression problem**. [More Information Needed] ## Bias, Risks, and Limitations **Disclaimer: This model should not be used for trading. Data source is not verified, assumption is that data is synthetically generated. This is just an example how to handle multi-target regression problem**. [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. **Disclaimer: This model should not be used for trading. Data source is not verified, assumption is that data is synthetically generated. This is just an example how to handle multi-target regression problem**. ## Technical Specifications ### Model Architecture and Objective [google/mobilebert-uncased](https://huggingface.co/google/mobilebert-uncased) ### Compute Infrastructure #### Hardware one P6000 GPU #### Software Pytorch and HuggingFace