SEC-EDGAR GPT-2 124M

A GPT-2 (124M) language model trained from scratch on SEC EDGAR filings (10-K, 10-Q, 8-K, etc.).

Model Details

Property Value
Architecture GPT-2 124M (12 layers, 12 heads, 768 hidden)
Parameters 124,475,904
Context Length 1,024 tokens
Tokenizer GPT-2 BPE (tiktoken)
Training Tokens ~1.55B (1 epoch)
Training Steps 47,000
Validation Loss 2.28
Training Framework nanoGPT
Training Hardware NVIDIA RTX 4070 12GB
Training Time ~8 hours
Bias No (bias=False)

Training Data

SEC EDGAR filings sourced from the SEC-EDGAR corpus on HuggingFace, covering annual reports (10-K), quarterly reports (10-Q), current reports (8-K), and other filing types. Tokenized with GPT-2 BPE into ~1.55B tokens across 16 shards.

Usage

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("lzwjava/sec-edgar-gpt-124m")
tokenizer = GPT2Tokenizer.from_pretrained("lzwjava/sec-edgar-gpt-124m")

prompt = "UNITED STATES SECURITIES AND EXCHANGE COMMISSION"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=200, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0]))

Limitations

  • Trained for only 1 epoch โ€” coherent for ~200-500 tokens before repetitive loops
  • No instruction tuning or RLHF โ€” raw language model
  • 124M parameters is small; don't expect state-of-the-art quality
  • GPT-2 tokenizer may not handle all financial notation optimally

Source Code

Training code and development notes: github.com/lzwjava/sec-edgar-gpt

Citation

@misc{sec-edgar-gpt-124m,
  author = {Zhiwei Li},
  title = {SEC-EDGAR GPT-2 124M},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/lzwjava/sec-edgar-gpt}
}
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support