rajistics commited on
Commit
0675fa9
1 Parent(s): fdadd07

first push of pretrain model

Browse files
Files changed (5) hide show
  1. .gitattributes +0 -4
  2. README.md +38 -1
  3. config.json +16 -0
  4. pytorch_model.bin +3 -0
  5. vocab.txt +0 -0
.gitattributes CHANGED
@@ -9,14 +9,10 @@
9
  *.lfs.* filter=lfs diff=lfs merge=lfs -text
10
  *.model filter=lfs diff=lfs merge=lfs -text
11
  *.msgpack filter=lfs diff=lfs merge=lfs -text
12
- *.npy filter=lfs diff=lfs merge=lfs -text
13
- *.npz filter=lfs diff=lfs merge=lfs -text
14
  *.onnx filter=lfs diff=lfs merge=lfs -text
15
  *.ot filter=lfs diff=lfs merge=lfs -text
16
  *.parquet filter=lfs diff=lfs merge=lfs -text
17
  *.pb filter=lfs diff=lfs merge=lfs -text
18
- *.pickle filter=lfs diff=lfs merge=lfs -text
19
- *.pkl filter=lfs diff=lfs merge=lfs -text
20
  *.pt filter=lfs diff=lfs merge=lfs -text
21
  *.pth filter=lfs diff=lfs merge=lfs -text
22
  *.rar filter=lfs diff=lfs merge=lfs -text
 
9
  *.lfs.* filter=lfs diff=lfs merge=lfs -text
10
  *.model filter=lfs diff=lfs merge=lfs -text
11
  *.msgpack filter=lfs diff=lfs merge=lfs -text
 
 
12
  *.onnx filter=lfs diff=lfs merge=lfs -text
13
  *.ot filter=lfs diff=lfs merge=lfs -text
14
  *.parquet filter=lfs diff=lfs merge=lfs -text
15
  *.pb filter=lfs diff=lfs merge=lfs -text
 
 
16
  *.pt filter=lfs diff=lfs merge=lfs -text
17
  *.pth filter=lfs diff=lfs merge=lfs -text
18
  *.rar filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,40 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - autotrain
4
+ - pre-trained
5
+ - finbert
6
+ - fill-mask
7
+ language: unk
8
+ widget:
9
+ - text: Tesla remains one of the highest [MASK] stocks on the market. Meanwhile, Aurora Innovation is a pre-revenue upstart that shows promise.
10
+ - text: Asian stocks [MASK] from a one-year low on Wednesday as U.S. share futures and oil recovered from the previous day's selloff, but uncertainty over the impact of the Omicron
11
+ - text: U.S. stocks were set to rise on Monday, led by [MASK] in Apple which neared $3 trillion in market capitalization, while investors braced for a Federal Reserve meeting later this week.
12
  ---
13
+
14
+ `FinBERT` is a BERT model pre-trained on financial communication text. The purpose is to enhance financial NLP research and practice.
15
+
16
+ ### Pre-training
17
+ It is trained on the following three financial communication corpus. The total corpora size is 4.9B tokens.
18
+
19
+ - Corporate Reports 10-K & 10-Q: 2.5B tokens
20
+ - Earnings Call Transcripts: 1.3B tokens
21
+ - Analyst Reports: 1.1B tokens
22
+ - Demo.org Proprietary Reports
23
+ - Additional purchased data from Factset
24
+
25
+ The entire training is done using an **NVIDIA DGX-1** machine. The server has 4 Tesla P100 GPUs, providing a total of 128 GB of GPU memory. This machine enables us to train the BERT models using a batch size of 128. We utilize Horovord framework for multi-GPU training. Overall, the total time taken to perform pretraining for one model is approximately **2 days**.
26
+
27
+
28
+ More details on `FinBERT`'s pre-training process can be found at: https://arxiv.org/abs/2006.08097
29
+
30
+ `FinBERT` can be further fine-tuned on downstream tasks. Specifically, we have fine-tuned `FinBERT` on an analyst sentiment classification task, and the fine-tuned model is shared at [https://huggingface.co/demo-org/auditor_review_model](https://huggingface.co/demo-org/auditor_review_model)
31
+
32
+ ### Usage
33
+ Load the model directly from Transformers:
34
+ ```
35
+ from transformers import AutoModelForMaskedLM
36
+ model = AutoModelForMaskedLM.from_pretrained("demo-org/finbert-pretrain", use_auth_token=True)
37
+ ```
38
+
39
+ ### Questions
40
+ Please contact the Data Science COE if you have more questions about this pre-trained model
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.1,
8
+ "hidden_size": 768,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 3072,
11
+ "max_position_embeddings": 512,
12
+ "num_attention_heads": 12,
13
+ "num_hidden_layers": 12,
14
+ "type_vocab_size": 2,
15
+ "vocab_size": 30873
16
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46dd5b5cbf7141b0c5d882516243abf71cfa4a27e57023c43290d655fccfb48c
3
+ size 441551705
vocab.txt ADDED
The diff for this file is too large to render. See raw diff