ESG-BERT / README.md

mukut03

updates authors

15b6b1e over 1 year ago

preview code

raw

history blame

No virus

11.7 kB

	---
	language:
	- en
	widget:
	- text: "In fiscal year 2019, we reduced our comprehensive carbon footprint for the fourth consecutive year—down 35 percent compared to 2015, when Apple’s carbon emissions peaked, even as net revenue increased by 11 percent over that same period. In the past year, we avoided over 10 million metric tons from our emissions reduction initiatives—like our Supplier Clean Energy Program, which lowered our footprint by 4.4 million metric tons. "
	example_title: "Reduced carbon footprint"
	- text: "We believe it is essential to establish validated conflict-free sources of 3TG within the Democratic Republic of the Congo (the “DRC”) and adjoining countries (together, with the DRC, the “Covered Countries”), so that these minerals can be procured in a way that contributes to economic growth and development in the region. To aid in this effort, we have established a conflict minerals policy and an internal team to implement the policy."
	example_title: "Conflict minerals policy"
	---
	# Model Card for ESG-BERT
	Domain Specific BERT Model for Text Mining in Sustainable Investing



	# Model Details

	## Model Description



	- Developed by: [Mukut Mukherjee](https://www.linkedin.com/in/mukutm/), [Charan Pothireddi](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/) and [Parabole.ai](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/)
	- Shared by [Optional]: HuggingFace
	- Model type: Language model
	- Language(s) (NLP): en
	- License: More information needed
	- Related Models:
	- Parent Model: BERT
	- Resources for more information:
	- [GitHub Repo](https://github.com/mukut03/ESG-BERT)
	- [Blog Post](https://towardsdatascience.com/nlp-meets-sustainable-investing-d0542b3c264b?source=friends_link&sk=1f7e6641c3378aaff319a81decf387bf)

	# Uses


	## Direct Use

	Text Mining in Sustainable Investing

	## Downstream Use [Optional]

	The applications of ESG-BERT can be expanded way beyond just text classification. It can be fine-tuned to perform various other downstream NLP tasks in the domain of Sustainable Investing.

	## Out-of-Scope Use

	The model should not be used to intentionally create hostile or alienating environments for people.
	# Bias, Risks, and Limitations


	Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.


	## Recommendations


	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations.


	# Training Details

	## Training Data

	More information needed

	## Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	### Preprocessing

	More information needed

	### Speeds, Sizes, Times

	More information needed

	# Evaluation



	## Testing Data, Factors & Metrics

	### Testing Data

	The fine-tuned model for text classification is also available [here](https://drive.google.com/drive/folders/1Qz4HP3xkjLfJ6DGCFNeJ7GmcPq65_HVe?usp=sharing). It can be used directly to make predictions using just a few steps. First, download the fine-tuned pytorch_model.bin, config.json, and vocab.txt

	### Factors

	More information needed

	### Metrics

	More information needed

	## Results

	ESG-BERT was further trained on unstructured text data with accuracies of 100% and 98% for Next Sentence Prediction and Masked Language Modelling tasks. Fine-tuning ESG-BERT for text classification yielded an F-1 score of 0.90. For comparison, the general BERT (BERT-base) model scored 0.79 after fine-tuning, and the sci-kit learn approach scored 0.67.

	# Model Examination

	More information needed

	# Environmental Impact


	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: More information needed
	- Hours used: More information needed
	- Cloud Provider: information needed
	- Compute Region: More information needed
	- Carbon Emitted: More information needed

	# Technical Specifications [optional]

	## Model Architecture and Objective

	More information needed

	## Compute Infrastructure

	More information needed

	### Hardware

	More information needed

	### Software

	JDK 11 is needed to serve the model

	# Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	More information needed

	APA:

	More information needed

	# Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	More information needed

	# More Information [optional]

	More information needed

	# Model Card Authors [optional]
	[Mukut Mukherjee](https://www.linkedin.com/in/mukutm/), [Charan Pothireddi](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/) and [Parabole.ai](https://www.linkedin.com/in/sree-charan-pothireddi-6a0a3587/), in collaboration with the Ezi Ozoani and the HuggingFace Team


	# Model Card Contact

	More information needed

	# How to Get Started with the Model

	Use the code below to get started with the model.

	<details>
	<summary> Click to expand </summary>

	```
	pip install torchserve torch-model-archiver

	pip install torchvision

	pip install transformers

	```

	Next up, we'll set up the handler script. It is a basic handler for text classification that can be improved upon. Save this script as "handler.py" in your directory. [1]

	```

	from abc import ABC

	import json

	import logging

	import os

	import torch

	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	from ts.torch_handler.base_handler import BaseHandler

	logger = logging.getLogger(__name__)

	class TransformersClassifierHandler(BaseHandler, ABC):

	"""

	Transformers text classifier handler class. This handler takes a text (string) and

	as input and returns the classification text based on the serialized transformers checkpoint.

	"""

	def __init__(self):

	super(TransformersClassifierHandler, self).__init__()

	self.initialized = False

	def initialize(self, ctx):

	self.manifest = ctx.manifest

	properties = ctx.system_properties

	model_dir = properties.get("model_dir")

	self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

	# Read model serialize/pt file

	self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)

	self.tokenizer = AutoTokenizer.from_pretrained(model_dir)

	self.model.to(self.device)

	self.model.eval()

	logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))

	# Read the mapping file, index to object name

	mapping_file_path = os.path.join(model_dir, "index_to_name.json")

	if os.path.isfile(mapping_file_path):

	with open(mapping_file_path) as f:

	self.mapping = json.load(f)

	else:

	logger.warning('Missing the index_to_name.json file. Inference output will not include class name.')

	self.initialized = True

	def preprocess(self, data):

	""" Very basic preprocessing code - only tokenizes.

	Extend with your own preprocessing steps as needed.

	"""

	text = data[0].get("data")

	if text is None:

	text = data[0].get("body")

	sentences = text.decode('utf-8')

	logger.info("Received text: '%s'", sentences)

	inputs = self.tokenizer.encode_plus(

	sentences,

	add_special_tokens=True,

	return_tensors="pt"

	)

	return inputs

	def inference(self, inputs):

	"""

	Predict the class of a text using a trained transformer model.

	"""

	# NOTE: This makes the assumption that your model expects text to be tokenized

	# with "input_ids" and "token_type_ids" - which is true for some popular transformer models, e.g. bert.

	# If your transformer model expects different tokenization, adapt this code to suit

	# its expected input format.

	prediction = self.model(

	inputs['input_ids'].to(self.device),

	token_type_ids=inputs['token_type_ids'].to(self.device)

	)[0].argmax().item()

	logger.info("Model predicted: '%s'", prediction)

	if self.mapping:

	prediction = self.mapping[str(prediction)]

	return [prediction]

	def postprocess(self, inference_output):

	# TODO: Add any needed post-processing of the model predictions here

	return inference_output

	_service = TransformersClassifierHandler()

	def handle(data, context):

	try:

	if not _service.initialized:

	_service.initialize(context)

	if data is None:

	return None

	data = _service.preprocess(data)

	data = _service.inference(data)

	data = _service.postprocess(data)

	return data

	except Exception as e:

	raise e



	```

	TorcheServe uses a format called MAR (Model Archive). We can convert our PyTorch model to a .mar file using this command:

	```

	torch-model-archiver --model-name "bert" --version 1.0 --serialized-file ./bert_model/pytorch_model.bin --extra-files "./bert_model/config.json,./bert_model/vocab.txt" --handler "./handler.py"

	```

	Move the .mar file into a new directory:

	```

	mkdir model_store && mv bert.mar model_store

	```

	Finally, we can start TorchServe using the command:

	```

	torchserve --start --model-store model_store --models bert=bert.mar

	```

	We can now query the model from another terminal window using the Inference API. We pass a text file containing text that the model will try to classify.




	```

	curl -X POST http://127.0.0.1:8080/predictions/bert -T predict.txt

	```

	This returns a label number which correlates to a textual label. This is stored in the label_dict.txt dictionary file.

	```

	__label__Business_Ethics : 0

	__label__Data_Security : 1

	__label__Access_And_Affordability : 2

	__label__Business_Model_Resilience : 3

	__label__Competitive_Behavior : 4

	__label__Critical_Incident_Risk_Management : 5

	__label__Customer_Welfare : 6

	__label__Director_Removal : 7

	__label__Employee_Engagement_Inclusion_And_Diversity : 8

	__label__Employee_Health_And_Safety : 9

	__label__Human_Rights_And_Community_Relations : 10

	__label__Labor_Practices : 11

	__label__Management_Of_Legal_And_Regulatory_Framework : 12

	__label__Physical_Impacts_Of_Climate_Change : 13

	__label__Product_Quality_And_Safety : 14

	__label__Product_Design_And_Lifecycle_Management : 15

	__label__Selling_Practices_And_Product_Labeling : 16

	__label__Supply_Chain_Management : 17

	__label__Systemic_Risk_Management : 18

	__label__Waste_And_Hazardous_Materials_Management : 19

	__label__Water_And_Wastewater_Management : 20

	__label__Air_Quality : 21

	__label__Customer_Privacy : 22

	__label__Ecological_Impacts : 23

	__label__Energy_Management : 24

	__label__GHG_Emissions : 25

	```

	<\details>