Alexeym12 commited on
Commit
917903a
1 Parent(s): 683c601

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md CHANGED
@@ -1,3 +1,137 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - ESG
5
  ---
6
+ ## Main information
7
+ We introduce the model for multilabel ESG risks classification. There is 47 classes methodology with granularial risk definition.
8
+
9
+ ## Usage
10
+ ```python
11
+
12
+ from transformers import MPNetPreTrainedModel, MPNetModel
13
+ import torch
14
+ #Mean Pooling - Take attention mask into account for correct averaging
15
+ def mean_pooling(model_output, attention_mask):
16
+ token_embeddings = model_output #First element of model_output contains all token embeddings
17
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
18
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
19
+
20
+ # Definition of ESGify class because of custom,sentence-transformers like, mean pooling function and classifier head
21
+ class ESGify(MPNetPreTrainedModel):
22
+ """Model for Classification ESG risks from text."""
23
+
24
+ def __init__(self,config): #tuning only the head
25
+ """
26
+ """
27
+ super().__init__(config)
28
+ # Instantiate Parts of model
29
+ self.mpnet = MPNetModel(config,add_pooling_layer=False)
30
+ self.classifier = torch.nn.Sequential(OrderedDict([('norm',torch.nn.BatchNorm1d(768)),
31
+ ('linear',torch.nn.Linear(768,512)),
32
+ ('act',torch.nn.ReLU()),
33
+ ('batch_n',torch.nn.BatchNorm1d(512)),
34
+ ('drop_class', torch.nn.Dropout(0.2)),
35
+ ('class_l',torch.nn.Linear(512 ,47))]))
36
+
37
+
38
+ def forward(self, input_ids, attention_mask):
39
+
40
+
41
+ # Feed input to mpnet model
42
+ outputs = self.mpnet(input_ids=input_ids,
43
+ attention_mask=attention_mask)
44
+
45
+ # mean pooling dataset
46
+ logits = self.classifier( mean_pooling(outputs['last_hidden_state'],attention_mask))
47
+ # Feed input to classifier to compute logits
48
+
49
+ return logits
50
+
51
+ model = ESGify.from_pretrained('ai-lab/ESGify')
52
+ tokenizer = AutoTokenizer.from_pretrained('ai-lab/ESGify')
53
+ texts = ['text1','text2']
54
+ to_model = tokenizer.batch_encode_plus(
55
+ texts,
56
+ add_special_tokens=True,
57
+ max_length=512,
58
+ return_token_type_ids=False,
59
+ padding="max_length",
60
+ truncation=True,
61
+ return_attention_mask=True,
62
+ return_tensors='pt',
63
+ )
64
+ results = model(**to_model)
65
+
66
+
67
+ # We also recommend preprocess texts with using FLAIR model
68
+
69
+ from flair.data import Sentence
70
+ from flair.nn import Classifier
71
+ from torch.utils.data import DataLoader
72
+ from nltk.corpus import stopwords
73
+ from nltk.tokenize import word_tokenize
74
+
75
+ stop_words = set(stopwords.words('english'))
76
+ tagger = Classifier.load('ner-ontonotes-large')
77
+ tag_list = ['FAC','LOC','ORG','PERSON']
78
+ texts_with_masks = []
79
+ for example_sent in texts:
80
+
81
+ word_tokens = word_tokenize(example_sent)
82
+ # converts the words in word_tokens to lower case and then checks whether
83
+ #they are present in stop_words or not
84
+ for w in word_tokens:
85
+ if w.lower() not in stop_words:
86
+ filtered_sentence.append(w)
87
+ # make a sentence
88
+ sentence = Sentence(' '.join(filtered_sentence))
89
+ # run NER over sentence
90
+ tagger.predict(sentence)
91
+ sent = ' '.join(filtered_sentence)
92
+ k = 0
93
+ new_string = ''
94
+ start_t = 0
95
+ for i in sentence.get_labels():
96
+ info = i.to_dict()
97
+ val = info['value']
98
+ if info['confidence']>0.8 and val in tag_list :
99
+
100
+ if i.data_point.start_position>start_t :
101
+ new_string+=sent[start_t:i.data_point.start_position]
102
+ start_t = i.data_point.end_position
103
+ new_string+= f'<{val}>'
104
+ new_string+=sent[start_t:-1]
105
+ texts_with_masks.append(new_string)
106
+
107
+ to_model = tokenizer.batch_encode_plus(
108
+ texts_with_masks,
109
+ add_special_tokens=True,
110
+ max_length=512,
111
+ return_token_type_ids=False,
112
+ padding="max_length",
113
+ truncation=True,
114
+ return_attention_mask=True,
115
+ return_tensors='pt',
116
+ )
117
+ results = model(**to_model)
118
+ ```
119
+
120
+ ------
121
+
122
+ ## Background
123
+
124
+ The project aims to develop the ESG Risks classification model with a custom ESG risks definition methodology.
125
+
126
+
127
+ ## Training procedure
128
+
129
+ ### Pre-training
130
+
131
+ We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model.
132
+ Next, we do the domain-adaptation procedure by Mask Language Modeling pertaining with using texts of ESG reports.
133
+
134
+
135
+ #### Training data
136
+
137
+ We use the ESG news dataset of 2000 texts with manually annotation of ESG specialists.