LizaKovtun commited on
Commit
eb9285d
1 Parent(s): 1129006

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -71
README.md CHANGED
@@ -5,17 +5,46 @@ tags:
5
  - finance
6
  language:
7
  - en
8
- pipeline_tag: text-classification
9
  ---
10
- ## Main information
11
- We introduce the model for multilabel ESG risks classification. There is 47 classes methodology with granularial risk definition.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- ## Usage
14
  ```python
15
  from collections import OrderedDict
16
  from transformers import MPNetPreTrainedModel, MPNetModel, AutoTokenizer
17
  import torch
18
- #Mean Pooling - Take attention mask into account for correct averaging
 
19
  def mean_pooling(model_output, attention_mask):
20
  token_embeddings = model_output #First element of model_output contains all token embeddings
21
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
@@ -42,8 +71,6 @@ class ESGify(MPNetPreTrainedModel):
42
 
43
 
44
  def forward(self, input_ids, attention_mask):
45
-
46
-
47
  # Feed input to mpnet model
48
  outputs = self.mpnet(input_ids=input_ids,
49
  attention_mask=attention_mask)
@@ -54,65 +81,21 @@ class ESGify(MPNetPreTrainedModel):
54
  # apply sigmoid
55
  logits = 1.0 / (1.0 + torch.exp(-logits))
56
  return logits
 
 
 
57
 
 
58
  model = ESGify.from_pretrained('ai-lab/ESGify')
59
  tokenizer = AutoTokenizer.from_pretrained('ai-lab/ESGify')
60
- texts = ['text1','text2']
61
- to_model = tokenizer.batch_encode_plus(
62
- texts,
63
- add_special_tokens=True,
64
- max_length=512,
65
- return_token_type_ids=False,
66
- padding="max_length",
67
- truncation=True,
68
- return_attention_mask=True,
69
- return_tensors='pt',
70
- )
71
- results = model(**to_model)
72
-
73
 
74
- # We also recommend preprocess texts with using FLAIR model
75
-
76
- from flair.data import Sentence
77
- from flair.nn import Classifier
78
- from torch.utils.data import DataLoader
79
- from nltk.corpus import stopwords
80
- from nltk.tokenize import word_tokenize
81
-
82
- stop_words = set(stopwords.words('english'))
83
- tagger = Classifier.load('ner-ontonotes-large')
84
- tag_list = ['FAC','LOC','ORG','PERSON']
85
- texts_with_masks = []
86
- for example_sent in texts:
87
- filtered_sentence = []
88
- word_tokens = word_tokenize(example_sent)
89
- # converts the words in word_tokens to lower case and then checks whether
90
- #they are present in stop_words or not
91
- for w in word_tokens:
92
- if w.lower() not in stop_words:
93
- filtered_sentence.append(w)
94
- # make a sentence
95
- sentence = Sentence(' '.join(filtered_sentence))
96
- # run NER over sentence
97
- tagger.predict(sentence)
98
- sent = ' '.join(filtered_sentence)
99
- k = 0
100
- new_string = ''
101
- start_t = 0
102
- for i in sentence.get_labels():
103
- info = i.to_dict()
104
- val = info['value']
105
- if info['confidence']>0.8 and val in tag_list :
106
-
107
- if i.data_point.start_position>start_t :
108
- new_string+=sent[start_t:i.data_point.start_position]
109
- start_t = i.data_point.end_position
110
- new_string+= f'<{val}>'
111
- new_string+=sent[start_t:-1]
112
- texts_with_masks.append(new_string)
113
 
 
 
114
  to_model = tokenizer.batch_encode_plus(
115
- texts_with_masks,
116
  add_special_tokens=True,
117
  max_length=512,
118
  return_token_type_ids=False,
@@ -124,21 +107,27 @@ to_model = tokenizer.batch_encode_plus(
124
  results = model(**to_model)
125
  ```
126
 
127
- ------
128
-
129
- ## Background
130
-
131
- The project aims to develop the ESG Risks classification model with a custom ESG risks definition methodology.
132
 
 
 
 
 
133
 
134
- ## Training procedure
 
 
 
 
 
135
 
136
- ### Pre-training
 
137
 
138
- We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model.
139
- Next, we do the domain-adaptation procedure by Mask Language Modeling pertaining with using texts of ESG reports.
140
 
 
141
 
142
- #### Training data
 
 
143
 
144
- We use the ESG news dataset of 2000 texts with manually annotation of ESG specialists.
 
5
  - finance
6
  language:
7
  - en
8
+
9
  ---
10
+ # About ESGify
11
+ **ESGify** is a model for multilabel news classification with respect to ESG risks. Our custom methodology includes 46 ESG classes and 1 non-relevant to ESG class, resulting in 47 classes in total:
12
+
13
+ | E | S | G |
14
+ | ----------- | ----------- | ----------- |
15
+ | **Biodiversity** | **Communities Health and Safety** | **Legal Proceedings & Law Violations** |
16
+ | **Emergencies (Environmental)** | **Land Acquisition and Resettlement (S)** | **Corporate Governance** |
17
+ | **Hazardous Materials Management** | **Emergencies (Social)** | **Responsible Investment & Greenwashing** |
18
+ | **Environmental Management** | **Human Rights** | **Economic Crime** |
19
+ | **Landscape Transformation** | **Labor Relations Management** | **Disclosure** |
20
+ | **Climate Risks** | **Freedom of Association and Right to Organise** | **Values and Ethics** |
21
+ | **Surface Water Pollution** | **Employee Health and Safety** | **Risk Management and Internal Control** |
22
+ | **Animal Welfare** | **Product Safety and Quality** | **Strategy Implementation** |
23
+ | **Water Consumption** | **Indigenous People** | **Supply Chain (Economic / Governance)** |
24
+ | **Greenhouse Gas Emissions** | **Cultural Heritage** ||
25
+ | **Air Pollution** | **Forced Labour** ||
26
+ | **Waste Management** | **Supply Chain (Social)** ||
27
+ | **Soil and Groundwater Impact** | **Discrimination** ||
28
+ | **Wastewater Management** | **Minimum Age and Child Labour** ||
29
+ | **Natural Resources** | **Data Safety** ||
30
+ | **Physical Impacts** | **Retrenchment** ||
31
+ | **Supply Chain (Environmental)** |||
32
+ | **Planning Limitations** |||
33
+ | **Energy Efficiency and Renewables** |||
34
+ | **Land Acquisition and Resettlement (E)** |||
35
+ | **Land Rehabilitation** |||
36
+
37
+
38
+ # Usage
39
+
40
+ ESGify is based on MPNet architecture but with a custom classification head. The ESGify class is defined is follows.
41
 
 
42
  ```python
43
  from collections import OrderedDict
44
  from transformers import MPNetPreTrainedModel, MPNetModel, AutoTokenizer
45
  import torch
46
+
47
+ # Mean Pooling - Take attention mask into account for correct averaging
48
  def mean_pooling(model_output, attention_mask):
49
  token_embeddings = model_output #First element of model_output contains all token embeddings
50
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
 
71
 
72
 
73
  def forward(self, input_ids, attention_mask):
 
 
74
  # Feed input to mpnet model
75
  outputs = self.mpnet(input_ids=input_ids,
76
  attention_mask=attention_mask)
 
81
  # apply sigmoid
82
  logits = 1.0 / (1.0 + torch.exp(-logits))
83
  return logits
84
+ ```
85
+
86
+ After defining model class, we initialize ESGify and tokenizer with the pre-trained weights
87
 
88
+ ```python
89
  model = ESGify.from_pretrained('ai-lab/ESGify')
90
  tokenizer = AutoTokenizer.from_pretrained('ai-lab/ESGify')
91
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
+ Getting results from the model:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
+ ```python
96
+ texts = ['text1','text2']
97
  to_model = tokenizer.batch_encode_plus(
98
+ texts,
99
  add_special_tokens=True,
100
  max_length=512,
101
  return_token_type_ids=False,
 
107
  results = model(**to_model)
108
  ```
109
 
110
+ To identify top-3 classes by relevance and their scores:
 
 
 
 
111
 
112
+ ```python
113
+ for i in torch.topk(results, k=3).indices.tolist()[0]:
114
+ print(f"{model.id2label[i]}: {np.round(results.flatten()[i].item(), 3)}")
115
+ ```
116
 
117
+ For example, for the news "She faced employment rejection because of her gender", we get the following top-3 labels:
118
+ ```
119
+ Discrimination: 0.944
120
+ Strategy Implementation: 0.82
121
+ Indigenous People: 0.499
122
+ ```
123
 
124
+ Before training our model, we masked words related to Organisation, Date, Country, and Person to prevent false associations between these entities and risks. Hence, we recommend to process text with FLAIR NER model before inference.
125
+ An example of such preprocessing is given in https://colab.research.google.com/drive/15YcTW9KPSWesZ6_L4BUayqW_omzars0l?usp=sharing.
126
 
 
 
127
 
128
+ # Training procedure
129
 
130
+ We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model.
131
+ Next, we do the domain-adaptation procedure by Mask Language Modeling with using texts of ESG reports.
132
+ Finally, we fine-tune our model on 2000 texts with manually annotation of ESG specialists.
133