File size: 6,983 Bytes
b0fd1ba
12f5f2d
b0fd1ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18381a2
b0fd1ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8daadc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0fd1ba
 
 
 
 
 
 
 
 
 
8daadc1
 
 
 
 
b0fd1ba
8daadc1
 
b0fd1ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
inference: false
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
---

<div style="clear: both;">
  <div style="float: left; margin-right 1em;">
    <h1><strong>FinISH (Finance-Identifying Sroberta for Hypernyms)</strong></h1>
  </div>
  <div>
    <h2><img src="https://pbs.twimg.com/profile_images/1333760924914753538/fQL4zLUw_400x400.png" alt="" width="25" height="25"></h2>
  </div>
  </div>
  
We present FinISH, a [SRoBERTa](https://huggingface.co/sentence-transformers/nli-roberta-base-v2) base model fine-tuned on the [FIBO ontology](https://spec.edmcouncil.org/fibo/) dataset for domain-specific representation learning on the [**Semantic Search**](https://www.sbert.net/examples/applications/semantic-search/README.html) downstream task.

## SRoBERTa Model Architecture
Sentence-RoBERTa (SRoBERTa) is a modification of the pretrained RoBERTa network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with RoBERTa to about 5 seconds with SRoBERTa, while maintaining the accuracy from RoBERTa. SRoBERTa has been evaluated on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

Paper: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/pdf/1908.10084.pdf).

Authors: *Nils Reimers and Iryna Gurevych*.

## Details on the downstream task (Semantic Search for Text Classification)
The objective of this task is to correctly classify a given term in the financial domain according to its prototypical hypernym in a list of available hypernyms:
* Bonds
* Forward
* Funds
* Future
* MMIs (Money Market Instruments)
* Option
* Stocks
* Swap
* Equity Index
* Credit Index
* Securities restrictions
* Parametric schedules
* Debt pricing and yields
* Credit Events
* Stock Corporation
* Central Securities Depository
* Regulatory Agency

This kind-based approach relies on identifying the closest hypernym semantically to the given term (even if they possess common properties with other hypernyms).

#### Data Description
The data is a scraped list of term definitions from the FIBO ontology website where each definition has been mapped to its closest hypernym from the proposed labels.
For multi-sentence definitions, we applied sentence-splitting by punctuation delimiters. We also applied lowercase transformation on all input data.

#### Data Instances
The dataset contains a label representing the hypernym of the given definition.
```json
{
  'label': 'bonds',
  'definition': 'callable convertible bond is a kind of callable bond, convertible bond.'
}
```

#### Data Fields
**label**: Can be one of the 17 predefined hypernyms.

**definition**: Financial term definition relating to a concept or object in the financial domain.

#### Data Splits
The data contains training data with **317101** entries.

#### Test set metrics
The representational learning model is evaluated on a representative test set with 20% of the entries. The test set is scored based on the following metrics:
* Average Accuracy
* Mean Rank (position of the correct label in a set of 5 model predictions)

We evaluate FinISH according to these metrics, where it outperforms other state-of-the-art sentence embeddings methods in this task.
* Average Accuracy: **0.73**
* Mean Rank: **1.61**

## Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
git clone https://github.com/huggingface/transformers.git
pip install -q ./transformers
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer, util
import torch
model = SentenceTransformer('yseop/roberta-base-finance-hypernym-identification')
# Our corpus containing the list of hypernym labels
hypernyms = ['Bonds',
\t\t\t'Forward',
\t\t\t'Funds',
\t\t\t'Future',
\t\t\t'MMIs',
\t\t\t'Option',
\t\t\t'Stocks',
\t\t\t'Swap',
\t\t\t'Equity Index',
\t\t\t'Credit Index',
\t\t\t'Securities restrictions',
\t\t\t'Parametric schedules',
\t\t\t'Debt pricing and yields',
\t\t\t'Credit Events',
\t\t\t'Stock Corporation',
\t\t\t'Central Securities Depository',
\t\t\t'Regulatory Agency']
hypernym_embeddings = model.encode(hypernyms, convert_to_tensor=True)
# Query sentences are financial terms to match to the predefined labels
queries = ['Convertible bond', 'weighted average coupon', 'Restriction 144-A']
# Find the closest 5 hypernyms of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(hypernyms))
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)
    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, hypernym_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)
    print("\
\
======================\
\
")
    print("Query:", query)
    print("\
Top 5 most similar hypernyms:")
    for score, idx in zip(top_results[0], top_results[1]):
        print(hypernyms[idx], "(Score: {:.4f})".format(score))
```
 
## Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Query sentences are financial terms to match to the predefined labels
queries = ['Convertible bond', 'weighted average coupon', 'Restriction 144-A']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('yseop/roberta-base-finance-hypernym-identification')
model = AutoModel.from_pretrained('yseop/roberta-base-finance-hypernym-identification')
# Tokenize sentences
encoded_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
query_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Query embeddings:")
print(query_embeddings)
```

**Created by:** [Yseop](https://www.yseop.com/) | Pioneer in Natural Language Generation (NLG) technology. Scaling human expertise through Natural Language Generation.