ESG-BERT / README.md

model documentation

1ae536b over 2 years ago

10.6 kB

	---
	language:
	- en

	---
	# Model Card for ESG-BERT
	Domain Specific BERT Model for Text Mining in Sustainable Investing



	# Model Details

	## Model Description



	- Developed by: [Charan Pothireddi](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/) and [Parabole.ai](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/)
	- Shared by [Optional]: HuggingFace
	- Model type: Language model
	- Language(s) (NLP): en
	- License: More information needed
	- Related Models:
	- Parent Model: BERT
	- Resources for more information:
	- [GitHub Repo](https://github.com/mukut03/ESG-BERT)
	- [Blog Post](https://towardsdatascience.com/nlp-meets-sustainable-investing-d0542b3c264b?source=friends_link&sk=1f7e6641c3378aaff319a81decf387bf)

	# Uses


	## Direct Use

	Text Mining in Sustainable Investing

	## Downstream Use [Optional]

	The applications of ESG-BERT can be expanded way beyond just text classification. It can be fine-tuned to perform various other downstream NLP tasks in the domain of Sustainable Investing.

	## Out-of-Scope Use

	The model should not be used to intentionally create hostile or alienating environments for people.
	# Bias, Risks, and Limitations


	Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.


	## Recommendations


	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations.


	# Training Details

	## Training Data

	More information needed

	## Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	### Preprocessing

	More information needed

	### Speeds, Sizes, Times

	More information needed

	# Evaluation



	## Testing Data, Factors & Metrics

	### Testing Data

	The fine-tuned model for text classification is also available [here](https://drive.google.com/drive/folders/1Qz4HP3xkjLfJ6DGCFNeJ7GmcPq65_HVe?usp=sharing). It can be used directly to make predictions using just a few steps. First, download the fine-tuned pytorch_model.bin, config.json, and vocab.txt

	### Factors

	More information needed

	### Metrics

	More information needed

	## Results

	ESG-BERT was further trained on unstructured text data with accuracies of 100% and 98% for Next Sentence Prediction and Masked Language Modelling tasks. Fine-tuning ESG-BERT for text classification yielded an F-1 score of 0.90. For comparison, the general BERT (BERT-base) model scored 0.79 after fine-tuning, and the sci-kit learn approach scored 0.67.

	# Model Examination

	More information needed

	# Environmental Impact


	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: More information needed
	- Hours used: More information needed
	- Cloud Provider: information needed
	- Compute Region: More information needed
	- Carbon Emitted: More information needed

	# Technical Specifications [optional]

	## Model Architecture and Objective

	More information needed

	## Compute Infrastructure

	More information needed

	### Hardware

	More information needed

	### Software

	JDK 11 is needed to serve the model

	# Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	More information needed

	APA:

	More information needed

	# Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	More information needed

	# More Information [optional]

	More information needed

	# Model Card Authors [optional]
	[Charan Pothireddi](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/) and [Parabole.ai](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/), in collaboration with the Ezi Ozoani and the HuggingFace Team


	# Model Card Contact

	More information needed

	# How to Get Started with the Model

	Use the code below to get started with the model.

	<details>
	<summary> Click to expand </summary>

	```
	pip install torchserve torch-model-archiver

	pip install torchvision

	pip install transformers

	```

	Next up, we'll set up the handler script. It is a basic handler for text classification that can be improved upon. Save this script as "handler.py" in your directory. [1]

	```

	from abc import ABC

	import json

	import logging

	import os

	import torch

	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	from ts.torch_handler.base_handler import BaseHandler

	logger = logging.getLogger(__name__)

	class TransformersClassifierHandler(BaseHandler, ABC):

	"""

	Transformers text classifier handler class. This handler takes a text (string) and

	as input and returns the classification text based on the serialized transformers checkpoint.

	"""

	def __init__(self):

	super(TransformersClassifierHandler, self).__init__()

	self.initialized = False

	def initialize(self, ctx):

	self.manifest = ctx.manifest

	properties = ctx.system_properties

	model_dir = properties.get("model_dir")

	self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

	# Read model serialize/pt file

	self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)

	self.tokenizer = AutoTokenizer.from_pretrained(model_dir)

	self.model.to(self.device)

	self.model.eval()

	logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))

	# Read the mapping file, index to object name

	mapping_file_path = os.path.join(model_dir, "index_to_name.json")

	if os.path.isfile(mapping_file_path):

	with open(mapping_file_path) as f:

	self.mapping = json.load(f)

	else:

	logger.warning('Missing the index_to_name.json file. Inference output will not include class name.')

	self.initialized = True

	def preprocess(self, data):

	""" Very basic preprocessing code - only tokenizes.

	Extend with your own preprocessing steps as needed.

	"""

	text = data[0].get("data")

	if text is None:

	text = data[0].get("body")

	sentences = text.decode('utf-8')

	logger.info("Received text: '%s'", sentences)

	inputs = self.tokenizer.encode_plus(

	sentences,

	add_special_tokens=True,

	return_tensors="pt"

	)

	return inputs

	def inference(self, inputs):

	"""

	Predict the class of a text using a trained transformer model.

	"""

	# NOTE: This makes the assumption that your model expects text to be tokenized

	# with "input_ids" and "token_type_ids" - which is true for some popular transformer models, e.g. bert.

	# If your transformer model expects different tokenization, adapt this code to suit

	# its expected input format.

	prediction = self.model(

	inputs['input_ids'].to(self.device),

	token_type_ids=inputs['token_type_ids'].to(self.device)

	)[0].argmax().item()

	logger.info("Model predicted: '%s'", prediction)

	if self.mapping:

	prediction = self.mapping[str(prediction)]

	return [prediction]

	def postprocess(self, inference_output):

	# TODO: Add any needed post-processing of the model predictions here

	return inference_output

	_service = TransformersClassifierHandler()

	def handle(data, context):

	try:

	if not _service.initialized:

	_service.initialize(context)

	if data is None:

	return None

	data = _service.preprocess(data)

	data = _service.inference(data)

	data = _service.postprocess(data)

	return data

	except Exception as e:

	raise e



	```

	TorcheServe uses a format called MAR (Model Archive). We can convert our PyTorch model to a .mar file using this command:

	```

	torch-model-archiver --model-name "bert" --version 1.0 --serialized-file ./bert_model/pytorch_model.bin --extra-files "./bert_model/config.json,./bert_model/vocab.txt" --handler "./handler.py"

	```

	Move the .mar file into a new directory:

	```

	mkdir model_store && mv bert.mar model_store

	```

	Finally, we can start TorchServe using the command:

	```

	torchserve --start --model-store model_store --models bert=bert.mar

	```

	We can now query the model from another terminal window using the Inference API. We pass a text file containing text that the model will try to classify.




	```

	curl -X POST http://127.0.0.1:8080/predictions/bert -T predict.txt

	```

	This returns a label number which correlates to a textual label. This is stored in the label_dict.txt dictionary file.

	```

	__label__Business_Ethics : 0

	__label__Data_Security : 1

	__label__Access_And_Affordability : 2

	__label__Business_Model_Resilience : 3

	__label__Competitive_Behavior : 4

	__label__Critical_Incident_Risk_Management : 5

	__label__Customer_Welfare : 6

	__label__Director_Removal : 7

	__label__Employee_Engagement_Inclusion_And_Diversity : 8

	__label__Employee_Health_And_Safety : 9

	__label__Human_Rights_And_Community_Relations : 10

	__label__Labor_Practices : 11

	__label__Management_Of_Legal_And_Regulatory_Framework : 12

	__label__Physical_Impacts_Of_Climate_Change : 13

	__label__Product_Quality_And_Safety : 14

	__label__Product_Design_And_Lifecycle_Management : 15

	__label__Selling_Practices_And_Product_Labeling : 16

	__label__Supply_Chain_Management : 17

	__label__Systemic_Risk_Management : 18

	__label__Waste_And_Hazardous_Materials_Management : 19

	__label__Water_And_Wastewater_Management : 20

	__label__Air_Quality : 21

	__label__Customer_Privacy : 22

	__label__Ecological_Impacts : 23

	__label__Energy_Management : 24

	__label__GHG_Emissions : 25

	```

	<\details>