CrudeBERT / README.md

Captain-1337

Update README.md

a1cd8d1 verified 15 days ago

preview code

raw

history blame contribute delete

No virus

6.69 kB

	## Predictive Power of Sentiment Analysis from Headlines for Crude Oil Prices
	### Understanding and Exploiting Deep Learning-based Sentiment Analysis from News Headlines for Predicting Price Movements of WTI Crude Oil

	This language model called CrudeBERT emerged during my master's thesis and introduced a novel sentiment analysis method.
	It was developed by fine-tuning [FinBERT: Financial Sentiment Analysis with Pre-trained Language Models](https://arxiv.org/pdf/1908.10063.pdf).

	In essence, CrudeBERT is a pre-trained NLP model that analyzes the sentiment of news headlines relevant to the value of crude oil.
	Here is an award-winning paper derived from this thesis which describes it in more detail: [CrudeBERT: Applying Economic Theory towards fine-tuning Transformer-based Sentiment Analysis Models to the Crude Oil Market](https://arxiv.org/abs/2305.06140.pdf)

	![CrudeBERT comparison_white_2](https://user-images.githubusercontent.com/42164041/135273552-4a9c4457-70e4-48d0-ac97-169daefab79e.png)

	Performing sentiment analysis on the news regarding a specific asset requires domain adaptation.
	Domain adaptation requires training data from examples with text and its associated sentiment polarity.
	The experiments show that pre-trained deep learning-based sentiment analysis can be further fine-tuned, and the conclusions of these experiments are as follows:

	* Deep learning-based sentiment analysis models from the general financial world, such as FinBERT, are of little or hardly any significance concerning the price development of crude oil. The reason behind this is a lack of domain adaptation of the sentiment. Moreover, the polarity of sentiment cannot be generalized and is highly dependent on the properties of its target.

	* The properties of crude oil prices are, according to the literature, determined by changes in supply and demand.
	News can convey information about these direction changes, can be broadly identified through query searches, and serve as a foundation for creating a training dataset to perform domain adaptation. For this purpose, news headlines tend to be rich enough in content to provide insights into supply and demand changes.
	Even when significantly reducing the number of headlines to more reputable sources.

	* Domain adaptation can be achieved to some extent by analyzing the properties of the target through a literature review and creating a corresponding training dataset to fine-tune the model. For example, considering supply and demand changes regarding crude oil seems to be a suitable component for a domain adaptation.

	To advance sentiment analysis applications in the domain of crude oil, this paper presents CrudeBERT.
	In general, sentiment analysis of crude oil headlines through CrudeBERT could be a viable source of insight into the price behavior of WTI crude oil.
	However, further research is required to see if CrudeBERT can serve as beneficial for predicting oil prices.
	For this reason, the codes and the thesis are publicly available on [GitHub] (https://github.com/Captain-1337/Master-Thesis).


	Here is a quick guide on how you can use CrudeBERT

	# Step one:
	Download the two files (crude_bert_config.json and crude_bert_model.bin)
	from https://huggingface.co/Captain-1337/CrudeBERT/tree/main

	# Step two:
	Create a Jupyter Notebook script in the same folder where the files are stored and include the code mentioned below:

	## Code:
	import torch
	from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
	import numpy as np
	import pandas as pd

	### List of example headlines
	headlines = [
	"Major Explosion, Fire at Oil Refinery in Southeast Philadelphia",
	"PETROLEOS confirms Gulf of Mexico oil platform accident",
	"CASUALTIES FEARED AT OIL ACCIDENT NEAR IRANS BORDER",
	"EIA Chief expects Global Oil Demand Growth 1 M B/D to 2011",
	"Turkey Jan-Oct Crude Imports +98.5% To 57.9M MT",
	"China’s crude oil imports up 78.30% in February 2019",
	"Russia Energy Agency: Sees Oil Output put Flat In 2005",
	"Malaysia Oil Production Steady This Year At 700,000 B/D",
	"ExxonMobil:Nigerian Oil Output Unaffected By Union Threat",
	"Yukos July Oil Output Flat On Mo, 1.73M B/D - Prime-Tass",
	"2nd UPDATE: Mexico’s Oil Output Unaffected By Hurricane",
	"UPDATE: Ecuador July Oil Exports Flat On Mo At 337,000 B/D",
	"China February Crude Imports -16.0% On Year",
	"Turkey May Crude Imports down 11.0% On Year",
	"Japan June Crude Oil Imports decrease 10.9% On Yr",
	"Iran’s Feb Oil Exports +20.9% On Mo at 1.56M B/D - Official",
	"Apache announces large petroleum discovery in Philadelphia",
	"Turkey finds oil near Syria, Iraq border"
	]
	example_headlines = pd.DataFrame(headlines, columns=["Headline"])

	config_path = './crude_bert_config.json'
	model_path = './crude_bert_model.bin'

	#### Load the configuration
	config = AutoConfig.from_pretrained(config_path)

	#### Create the model from the configuration
	model = AutoModelForSequenceClassification.from_config(config)

	#### Load the model's state dictionary
	state_dict = torch.load(model_path)

	#### Inspect keys, if "bert.embeddings.position_ids" is unexpected, remove or adjust it
	state_dict.pop("bert.embeddings.position_ids", None)

	#### Load the adjusted state dictionary into the model
	model.load_state_dict(state_dict, strict=False) # Using strict=False to ignore non-critical mismatches

	#### Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

	### Define the prediction function
	def predict_to_df(texts, model, tokenizer):
	model.eval()
	data = []
	for text in texts:
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	softmax_scores = torch.nn.functional.softmax(logits, dim=-1)
	pred_label_id = torch.argmax(softmax_scores, dim=-1).item()
	class_names = ['positive', 'negative', 'neutral']
	predicted_label = class_names[pred_label_id]
	data.append([text, predicted_label])
	df = pd.DataFrame(data, columns=["Headline", "Classification"])
	return df


	### Create DataFrame
	example_headlines = pd.DataFrame(headlines, columns=["Headline"])

	### Apply classification
	result_df = predict_to_df(example_headlines['Headline'].tolist(), model, tokenizer)
	result_df

	# Step three:
	Execute the cells of the Jupyter Notebook.

	If you face any difficulties or have other questions, contact me here or on LinkedIn.

	FYI: I took the example headlines from one of our recent publications:
	![image.png](https://cdn-uploads.huggingface.co/production/uploads/6115fd952999876a45605b05/rFMJjRIxsNqPqinqiq5QY.png)

	So, your classification output should reflect this as well.