model documentation

#8
by nazneen - opened
Files changed (1) hide show
  1. README.md +189 -10
README.md CHANGED
@@ -1,13 +1,192 @@
1
- `FinBERT` is a BERT model pre-trained on financial communication text. The purpose is to enhance financial NLP research and practice. It is trained on the following three financial communication corpus. The total corpora size is 4.9B tokens.
 
 
2
 
3
- - Corporate Reports 10-K & 10-Q: 2.5B tokens
4
- - Earnings Call Transcripts: 1.3B tokens
5
- - Analyst Reports: 1.1B tokens
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
- If you use the model in your academic work, please cite the following papers:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- Huang, Allen H., Hui Wang, and Yi Yang. "FinBERT: A Large Language Model for Extracting Information from Financial Text." *Contemporary Accounting Research* (2022).
10
-
11
- Yang, Yi, Mark Christopher Siy Uy, and Allen Huang. "Finbert: A pretrained language model for financial communications." *arXiv preprint arXiv:2006.08097* (2020).
12
-
13
- `FinBERT` can be further fine-tuned on downstream tasks. Specifically, we have fine-tuned `FinBERT` for financial sentiment analysis, ESG classification, Forward-looking statement classification and etc. Visit [FinBERT.AI](https://finbert.ai/) for more details on these task-specific models and recent development of FinBERT.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
 
5
+ ---
6
+ # Model Card for FinBERT
7
+
8
+
9
+ # Model Details
10
+
11
+ ## Model Description
12
+ `FinBERT` is a BERT model pre-trained on financial communication text. The purpose is to enhance financial NLP research and practice.
13
+
14
+ - **Developed by:** Yi Yang
15
+ - **Shared by [Optional]:** Hugging Face
16
+ - **Model type:** Fill-Mask
17
+ - **Language(s) (NLP):** en
18
+ - **License:** More information needed
19
+ - **Related Models:** More information needed
20
+ - **Parent Model:** BERT
21
+ - **Resources for more information:**
22
+ - [GitHub Repo](https://github.com/yya518/FinBERT)
23
+ - [Associated Paper](https://arxiv.org/abs/2006.08097)
24
+ - [Website](https://finbert.ai/)
25
+
26
+ # Uses
27
+
28
+
29
+
30
+ ## Direct Use
31
+ Fill-Mask
32
+
33
+ ## Downstream Use [Optional]
34
+
35
+ `FinBERT` can be further fine-tuned on downstream tasks. Specifically, we have fine-tuned `FinBERT` for financial sentiment analysis, ESG classification, Forward-looking statement classification and etc.
36
+
37
+ ## Out-of-Scope Use
38
+
39
+ The model should not be used to intentionally create hostile or alienating environments for people.
40
+
41
+ # Bias, Risks, and Limitations
42
+
43
+
44
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
45
+
46
+
47
+ ## Recommendations
48
+
49
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
50
+
51
+
52
+ # Training Details
53
+
54
+ ## Training Data
55
+
56
+ FinBERT is trained on the following three financial communication corpus. The total corpora size is 4.9B tokens.
57
+
58
+ - **Corporate Reports 10-K & 10-Q:** 2.5B tokens
59
+ - **Earnings Call Transcripts:** 1.3B tokens
60
+ - **Analyst Reports:** 1.1B tokens
61
+
62
+
63
+ ## Training Procedure
64
+
65
+ ### Preprocessing
66
+
67
+ More information needed
68
+
69
+ ### Speeds, Sizes, Times
70
+
71
+ FinVocab is a new WordPiece vocabulary on our financial corpora using the SentencePiece library. We produce both cased and uncased versions of FinVocab, with sizes of 28,573 and 30,873 tokens respectively. This is very similar to the 28,996 and 30,522 token sizes of the original BERT cased and uncased BaseVocab.
72
+
73
+
74
+
75
+ # Evaluation
76
+
77
+
78
+
79
+ ## Testing Data, Factors & Metrics
80
+
81
+ ### Testing Data
82
+
83
+ More information needed
84
+
85
+ ### Factors
86
+
87
+ More information needed
88
+
89
+ ### Metrics
90
+
91
+ More information needed
92
+
93
+ ## Results
94
+
95
+ More information needed
96
+
97
+ # Model Examination
98
+
99
+ More information needed
100
+
101
+ # Environmental Impact
102
+
103
+
104
+
105
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
106
+
107
+ - **Hardware Type:** More information needed
108
+ - **Hours used:** More information needed
109
+ - **Cloud Provider:** More information needed
110
+ - **Compute Region:** More information needed
111
+ - **Carbon Emitted:** More information needed
112
+
113
+ # Technical Specifications [optional]
114
+
115
+ ## Model Architecture and Objective
116
+
117
+ More information needed
118
+
119
+ ## Compute Infrastructure
120
+
121
+ More information needed
122
+
123
+ ### Hardware
124
+
125
+ More information needed
126
+
127
+ ### Software
128
+
129
+ More information needed
130
+
131
+ # Citation
132
+
133
+
134
+
135
+ **BibTeX:**
136
+
137
+ ```
138
+ @misc{yang2020finbert,
139
+ title={FinBERT: A Pretrained Language Model for Financial Communications},
140
+ author={Yi Yang and Mark Christopher Siy UY and Allen Huang},
141
+ year={2020},
142
+ eprint={2006.08097},
143
+ archivePrefix={arXiv},
144
+ }
145
 
146
+ ```
147
+
148
+
149
+
150
+
151
+ # Glossary [optional]
152
+ More information needed
153
+
154
+ # More Information [optional]
155
+
156
+ Please post a Github issue or contact [imyiyang@ust.hk](imyiyang@ust.hk) if you have any questions.
157
+
158
+ # Model Card Authors [optional]
159
+ Yi Yang in collaboration with Ezi Ozoani and the Hugging Face team.
160
+
161
+ # Model Card Contact
162
+
163
+ More information needed
164
+
165
+ # How to Get Started with the Model
166
+
167
+ Use the code below to get started with the model.
168
+
169
+ <details>
170
+ <summary> Click to expand </summary>
171
 
172
+ ```python
173
+ from transformers import BertTokenizer, BertForSequenceClassification
174
+ import numpy as np
175
+
176
+ finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
177
+ tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
178
+
179
+ sentences = ["there is a shortage of capital, and we need extra financing",
180
+ "growth is strong and we have plenty of liquidity",
181
+ "there are doubts about our finances",
182
+ "profits are flat"]
183
+
184
+ inputs = tokenizer(sentences, return_tensors="pt", padding=True)
185
+ outputs = finbert(**inputs)[0]
186
+
187
+ labels = {0:'neutral', 1:'positive',2:'negative'}
188
+ for idx, sent in enumerate(sentences):
189
+ print(sent, '----', labels[np.argmax(outputs.detach().numpy()[idx])])
190
+
191
+ ```
192
+ </details>