metadata

language: en
pipeline_tag: fill-mask
license: cc-by-sa-4.0
thumbnail: https://i.ibb.co/0yz81K9/sec-bert-logo.png
tags:
  - finance
  - financial
widget:
  - text: Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018.
  - text: Total net sales decreased 2% or $5.4 [MASK] during 2019 compared to 2018.
  - text: >-
      During 2019, the Company [MASK] $67.1 billion of its common stock and paid
      dividend equivalents of $14.1 billion.
  - text: >-
      During 2019, the Company repurchased $67.1 billion of its common [MASK]
      and paid dividend equivalents of $14.1 billion.
  - text: >-
      During 2019, the Company repurchased $67.1 billion of its common stock and
      paid [MASK] equivalents of $14.1 billion.
  - text: >-
      During 2019, the Company repurchased $67.1 billion of its common stock and
      paid dividend [MASK] of $14.1 billion.

SEC-BERT

SEC-BERT is a family of BERT models for the financial domain, intended to assist financial NLP research and FinTech applications.
SEC-BERT consists of the following models:

SEC-BERT-BASE (this model)
SEC-BERT-NUM: We replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation)
SEC-BERT-SHAPE: We replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented.
(e.g. , '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]').

Pre-training corpus

The model was pre-trained on 260,773 10-K filings from 1993-2019, publicly available at U.S. Securities and Exchange Commission (SEC)

Pre-training details

We created a new vocabulary of 30k subwords by training a BertWordPieceTokenizer from scratch on the pre-training corpus.
We trained BERT using the official code provided in Google BERT's GitHub repository.
We then used Hugging Face's Transformers conversion script to convert the TF checkpoint in the desired format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
We release a model similar to the English BERT-BASE model (12-layer, 768-hidden, 12-heads, 110M parameters).
We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
We were able to use a single Google Cloud TPU v3-8 provided for free from TensorFlow Research Cloud (TFRC), while also utilizing GCP research credits. Huge thanks to both Google programs for supporting us!

Load Pretrained Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")

Use LEBAL-BERT variants as Language Models

Sample	Masked Token
Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018.	decreased

Model	Predictions (Probability)
BERT-BASE-UNCASED	increased (0.221), were (0.131), are (0.103), rose (0.075), of (0.058)
SEC-BERT-BASE	increased (0.678), decreased (0.282), declined (0.017), grew (0.016), rose (0.004)
SEC-BERT-NUM	increased (0.665), decreased (0.281), grew (0.028), declined (0.015), rose (0.008)
SEC-BERT-SHAPE	increased (0.793), decreased (0.145), grew (0.042), declined (0.011), rose (0.003)

Model	Masked Token	Predictions (Probability)
Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018.
BERT-BASE-UNCASED	decreased	increased (0.221), were (0.131), are (0.103), rose (0.075), of (0.058)
SEC-BERT-BASE	decreased	increased (0.678), decreased (0.282), declined (0.017), grew (0.016), rose (0.004)
SEC-BERT-NUM	decreased	increased (0.665), decreased (0.281), grew (0.028), declined (0.015), rose (0.008)
SEC-BERT-SHAPE	decreased	increased (0.793), decreased (0.145), grew (0.042), declined (0.011), rose (0.003)

Total net sales decreased 2% or $5.4 [MASK] during 2019 compared to 2018.
BERT-BASE-UNCASED	billion
SEC-BERT-BASE	billion
SEC-BERT-NUM	billion
SEC-BERT-SHAPE	billion

During 2019, the Company [MASK] $67.1 billion of its common stock and paid dividend equivalents of $14.1 billion.
BERT-BASE-UNCASED	repurchased	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-BASE	repurchased	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-NUM	repurchased	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-SHAPE	repurchased	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')

During 2019, the Company repurchased $67.1 billion of its common [MASK] and paid dividend equivalents of $14.1 billion.
BERT-BASE-UNCASED	stock	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-BASE	stock	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-NUM	stock	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-SHAPE	stock	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')

During 2019, the Company repurchased $67.1 billion of its common stock and paid [MASK] equivalents of $14.1 billion.
BERT-BASE-UNCASED	dividend	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-BASE	dividend	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-NUM	dividend	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-SHAPE	dividend	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')

During 2019, the Company repurchased $67.1 billion of its common stock and paid dividend [MASK] of $14.1 billion.
BERT-BASE-UNCASED	equivalents	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-BASE	equivalents	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-NUM	equivalents	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
SEC-BERT-SHAPE	equivalents	('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')

About Us

AUEB's Natural Language Processing Group develops algorithms, models, and systems that allow computers to process and generate natural language texts.

The group's current research interests include:

question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
natural language generation from databases and ontologies, especially Semantic Web ontologies, text classification, including filtering spam and abusive content,
information extraction and opinion mining, including legal text analytics and sentiment analysis,
natural language processing tools for Greek, for example parsers and named-entity recognizers, machine learning in natural language processing, especially deep learning.

The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.

Manos Fergadiotis on behalf of AUEB's Natural Language Processing Group