nlpaueb commited on
Commit
3554213
1 Parent(s): db9b583

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -5
README.md CHANGED
@@ -18,11 +18,14 @@ widget:
18
 
19
  <img align="center" src="https://i.ibb.co/0yz81K9/sec-bert-logo.png" alt="sec-bert-logo" width="400"/>
20
 
 
 
21
  SEC-BERT is a family of BERT models for the financial domain, intended to assist financial NLP research and FinTech applications.
22
  SEC-BERT consists of the following models:
23
- * [SEC-BERT-BASE](https://huggingface.co/nlpaueb/sec-bert-base): Same architecture as BERT-BASE trained on financial documents.
24
- * [SEC-BERT-NUM](https://huggingface.co/nlpaueb/sec-bert-num): Same as SEC-BERT-BASE but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation)
25
- * SEC-BERT-SHAPE (this model): Same as SEC-BERT-BASE but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.
 
26
 
27
  ## Pre-training corpus
28
 
@@ -30,12 +33,15 @@ The model was pre-trained on 260,773 10-K filings from 1993-2019, publicly avail
30
 
31
  ## Pre-training details
32
 
 
 
33
  * We created a new vocabulary of 30k subwords by training a [BertWordPieceTokenizer](https://github.com/huggingface/tokenizers) from scratch on the pre-training corpus.
34
  * We trained BERT using the official code provided in [Google BERT's GitHub repository](https://github.com/google-research/bert)</a>.
35
  * We then used [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) conversion script to convert the TF checkpoint in the desired format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
36
  * We release a model similar to the English BERT-BASE model (12-layer, 768-hidden, 12-heads, 110M parameters).
37
  * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
38
  * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TRC)](https://sites.research.google/trc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
 
39
 
40
  ## Load Pretrained Model
41
 
@@ -48,8 +54,11 @@ model = AutoModel.from_pretrained("nlpaueb/sec-bert-shape")
48
 
49
  ## Pre-process Text
50
 
 
 
51
  To use SEC-BERT-SHAPE, you have to pre-process texts replacing every numerical token with the corresponding shape pseudo-token, from a list of 214 predefined shape pseudo-tokens. If the numerical token does not correspond to any shape pseudo-token we replace it with the [NUM] pseudo-token.
52
  Below there is an example of how you can pre-process a simple sentence. This approach is quite simple; feel free to modify it as you see fit.
 
53
 
54
  ```python
55
  import re
@@ -219,10 +228,13 @@ print(tokenized_sentence)
219
 
220
  ## Publication
221
 
222
- The model has been officially released with the following article:<br>
223
- **["FiNER: Financial Numeric Entity Recognition for XBRL Tagging"](https://arxiv.org/abs/2203.06482)**<br>
 
 
224
  Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos and George Paliouras<br>
225
  In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) (Long Papers), Dublin, Republic of Ireland, May 22 - 27, 2022
 
226
 
227
  ```
228
  @inproceedings{loukas-etal-2022-finer,
@@ -244,6 +256,8 @@ In the Proceedings of the 60th Annual Meeting of the Association for Computation
244
 
245
  ## About Us
246
 
 
 
247
  [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
248
 
249
  The group's current research interests include:
@@ -255,5 +269,6 @@ text classification, including filtering spam and abusive content,
255
  machine learning in natural language processing, especially deep learning.
256
 
257
  The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.
 
258
 
259
  [Manos Fergadiotis](https://manosfer.github.io) on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
 
18
 
19
  <img align="center" src="https://i.ibb.co/0yz81K9/sec-bert-logo.png" alt="sec-bert-logo" width="400"/>
20
 
21
+ <div style="text-align: justify">
22
+
23
  SEC-BERT is a family of BERT models for the financial domain, intended to assist financial NLP research and FinTech applications.
24
  SEC-BERT consists of the following models:
25
+ * [**SEC-BERT-BASE**](https://huggingface.co/nlpaueb/sec-bert-base): Same architecture as BERT-BASE trained on financial documents.
26
+ * [**SEC-BERT-NUM**](https://huggingface.co/nlpaueb/sec-bert-num): Same as SEC-BERT-BASE but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation)
27
+ * **SEC-BERT-SHAPE** (this model): Same as SEC-BERT-BASE but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.
28
+ </div>
29
 
30
  ## Pre-training corpus
31
 
 
33
 
34
  ## Pre-training details
35
 
36
+ <div style="text-align: justify">
37
+
38
  * We created a new vocabulary of 30k subwords by training a [BertWordPieceTokenizer](https://github.com/huggingface/tokenizers) from scratch on the pre-training corpus.
39
  * We trained BERT using the official code provided in [Google BERT's GitHub repository](https://github.com/google-research/bert)</a>.
40
  * We then used [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) conversion script to convert the TF checkpoint in the desired format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
41
  * We release a model similar to the English BERT-BASE model (12-layer, 768-hidden, 12-heads, 110M parameters).
42
  * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
43
  * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TRC)](https://sites.research.google/trc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
44
+ </div>
45
 
46
  ## Load Pretrained Model
47
 
 
54
 
55
  ## Pre-process Text
56
 
57
+ <div style="text-align: justify">
58
+
59
  To use SEC-BERT-SHAPE, you have to pre-process texts replacing every numerical token with the corresponding shape pseudo-token, from a list of 214 predefined shape pseudo-tokens. If the numerical token does not correspond to any shape pseudo-token we replace it with the [NUM] pseudo-token.
60
  Below there is an example of how you can pre-process a simple sentence. This approach is quite simple; feel free to modify it as you see fit.
61
+ </div>
62
 
63
  ```python
64
  import re
 
228
 
229
  ## Publication
230
 
231
+ <div style="text-align: justify">
232
+
233
+ If you use this model cite the following article:<br>
234
+ [**FiNER: Financial Numeric Entity Recognition for XBRL Tagging**](https://arxiv.org/abs/2203.06482)<br>
235
  Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos and George Paliouras<br>
236
  In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) (Long Papers), Dublin, Republic of Ireland, May 22 - 27, 2022
237
+ </div>
238
 
239
  ```
240
  @inproceedings{loukas-etal-2022-finer,
 
256
 
257
  ## About Us
258
 
259
+ <div style="text-align: justify">
260
+
261
  [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
262
 
263
  The group's current research interests include:
 
269
  machine learning in natural language processing, especially deep learning.
270
 
271
  The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.
272
+ </div>
273
 
274
  [Manos Fergadiotis](https://manosfer.github.io) on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)