Financial Filings Sparse Encoder v1

This is a Sentence Transformers SparseEncoder / SPLADE-style model fine-tuned for financial filing retrieval.

The practical recommendation from the experiments below is to index document vectors after top-128 pruning. In the current proxy retrieval benchmark, top-128 keeps almost all unpruned quality while reducing each document to about 126 active sparse terms.

Naming note: the model is not document-only. It is an asymmetric query/document sparse encoder, following the OpenSearch model-family convention where the heavier document-side path is emphasized. Use encode_query(...) for online queries and encode_document(...) for offline document/chunk indexing.

TL;DR

setting value
Base model opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
Dataset oneryalcin/financial-filings-sparse-retrieval-training, combined config
Training recipe SpladeLoss(SparseMultipleNegativesRankingLoss)
Final steps 1500
Recommended serving document top_k=128
Triplet accuracy, top-128 78.0%
Retrieval proxy Recall@10, top-128 67.2%
Retrieval proxy nDCG@10, top-128 0.521

This is an experiment report as much as a model card. It describes what was tried, why decisions were made, what worked, and what remains unproven.

Why this exists

Financial filing search is not generic semantic similarity. Queries often refer to company events, accounting concepts, risk factors, segment details, and filing-specific language. A learned sparse model is attractive because it can improve ranking while preserving an Elasticsearch/OpenSearch-style sparse retrieval path.

The goal here was to train a domain-adapted sparse encoder that can be deployed as weighted sparse terms, then test whether the learned sparse signal is meaningfully better than both the base sparse model and a lexical BM25 baseline.

Starting point

This experiment started after the Sentence Transformers v5.5.0 release and its new train-sentence-transformers agent skill. The release made it easier to create a complete sparse-encoder training workflow: base model choice, loss selection, sparse regularization, top-k pruning checks, and model-card packaging.

Before training, the important constraints were:

  • Use a sparse encoder suitable for Elasticsearch/OpenSearch-style retrieval.
  • Keep vectors sparse enough to index in practice.
  • Use the user's existing Hugging Face dataset and local Apple Silicon machine.
  • Avoid overclaiming from pairwise training accuracy alone.

Base model decision

Chosen base model:

opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill

Reason: it already has separate query/document sparse encoding behavior and is aligned with OpenSearch neural sparse retrieval. Starting here means fine-tuning adapts a serving-compatible sparse model rather than building a new retrieval stack from scratch.

The doc wording in the base model name does not mean queries are encoded with the document encoder. This model should be used with the routed Sentence Transformers sparse API:

query_vectors = model.encode_query(queries)
document_vectors = model.encode_document(documents)

In practice, query encoding is the lightweight online side, while document encoding is the heavier offline/indexing side. The top-k pruning recommendation applies to document vectors before indexing.

Dataset

Dataset:

oneryalcin/financial-filings-sparse-retrieval-training

Config:

combined

Each usable row is treated as:

query
positive filing chunk
first non-empty hard negative filing chunk

Training used up to 20,000 requested rows. After dropping rows without a non-empty negative, there were 18,247 usable training triplets.

Evaluation used the first 1,000 usable rows from the held-out test split.

Known dataset fields include query, positive, negatives, query_type, company, and doc_type.

The exact v1 dataset preparation step is intentionally small and reproducible:

  1. Load the configured split from the Hugging Face dataset.
  2. Drop rows missing query, positive, or a non-empty negatives value.
  3. Select the first non-empty negative from negatives.
  4. Emit query, positive, negative, and available metadata.

No metadata prefixes, query rewrites, false-negative adjudication, mined negatives, or slice rebalancing were used for v1 training.

Training recipe

Model type:

sentence_transformers.sparse_encoder.SparseEncoder

Loss:

SpladeLoss(
  SparseMultipleNegativesRankingLoss,
  query_regularizer_weight=1e-4,
  document_regularizer_weight=1e-2,
)

Important settings:

setting value
max steps 1500
batch size 8
train rows requested 20,000
usable train rows 18,247
query regularization 1e-4
document regularization 1e-2
sampler no-duplicates batch sampler
platform local Apple Silicon, MPS where supported
package cutoff uv run --exclude-newer=2026-05-13T00:00:00Z

Why this recipe:

  • SparseMultipleNegativesRankingLoss teaches the model to rank the paired positive above in-batch negatives.
  • SpladeLoss adds FLOPS-style sparse regularization so the model does not emit unindexably dense document vectors.
  • A no-duplicates sampler avoids false negatives inside contrastive batches.
  • Document regularization matters more than query regularization for index footprint.

Decision log

decision point choice why
Retrieval family SparseEncoder / SPLADE-style sparse retrieval Best fit for Elasticsearch/OpenSearch sparse retrieval.
Base model OpenSearch sparse doc v2 distill Already query/document routed and serving-aligned.
Vocab Keep base tokenizer/vocab Changing vocabulary would be a larger pretraining-style project, not a fine-tune.
Loss SpladeLoss + SparseMultipleNegativesRankingLoss Combines ranking pressure with sparse regularization.
Sampler no-duplicates Reduces accidental false negatives in batch.
Doc regularization increased to 1e-2 Lower regularization gave high scores but thousands of active doc terms.
Serving vector size top_k=128 Preserved almost all quality with much smaller document vectors.
Final run length 1500 steps 500-step runs were promising; longer training improved both triplet and retrieval proxy scores.
Evaluation strategy triplet + retrieval proxy + base sparse + BM25 Avoids relying on a single easy metric.

Regularization sweep

The first 500-step sweep showed the main tradeoff. Low document regularization produced good triplet accuracy but extremely dense document vectors.

doc reg triplet accuracy positive doc dims interpretation
8e-5 73.8% 4204.9 Too dense.
1.5e-4 74.8% 3214.8 Best raw 500-step accuracy, still too dense.
3e-4 73.4% 2101.4 Still too dense.
1e-3 72.7% 1340.5 Better, still high.
3e-3 73.4% 657.1 Practical direction.
1e-2 73.2% 296.7 Chosen for longer run.
3e-2 72.3% 160.9 Compact, lower quality.

The final model uses document_regularizer_weight=1e-2 and 1500 steps.

Evaluation 1: triplet accuracy

For each held-out row:

score(query, positive) > score(query, hard_negative)

Metric:

accuracy = fraction of rows where positive scores higher than the paired negative
model doc pruning accuracy mean margin query dims positive doc dims
base sparse unpruned 49.6% 0.451 15.2 371.1
fine-tuned sparse unpruned 78.1% 3.004 15.2 340.9
fine-tuned sparse top-128 78.0% 2.992 15.2 126.7
fine-tuned sparse top-64 75.9% 2.988 15.2 64.0

Interpretation: the fine-tuned model learned the domain signal strongly. Top-128 preserved almost all pairwise quality.

Evaluation 2: in-memory retrieval proxy

A retrieval candidate pool was built from the held-out test split:

  • 1,000 held-out queries
  • all unique positives from those rows
  • all unique first hard negatives from those rows
  • 1,912 unique candidate chunks total

For each query, the known positive chunk is the only labeled relevant document. The model ranks all 1,912 candidate chunks by sparse dot product. BM25 ranks the same candidate corpus with a local lexical implementation.

Metrics:

  • Recall@1
  • Recall@5
  • Recall@10
  • Recall@20
  • MRR@10
  • nDCG@10
  • median rank
model pruning Recall@1 Recall@5 Recall@10 Recall@20 MRR@10 nDCG@10 median rank
fine-tuned sparse unpruned 39.0% 58.9% 67.5% 75.0% 0.479 0.526 3
fine-tuned sparse top-128 38.6% 57.8% 67.2% 73.9% 0.473 0.521 3
fine-tuned sparse top-64 35.0% 55.7% 64.8% 72.4% 0.442 0.491 4
base sparse unpruned 32.1% 56.5% 63.7% 69.9% 0.431 0.481 3
base sparse top-128 31.4% 54.4% 62.7% 69.7% 0.422 0.472 3
base sparse top-64 29.5% 52.6% 59.5% 66.5% 0.396 0.444 4
BM25 lexical 24.0% 58.2% 64.1% 68.6% 0.397 0.457 3

Interpretation:

  • Top-128 fine-tuned sparse is the best current deployment candidate.
  • BM25 is competitive at Recall@10, but worse at early precision and ranking quality.
  • Fine-tuning mostly improved early ranking: Recall@1 rose from 32.1% for the base sparse model to 39.0% unpruned, and 38.6% at top-128.
  • Top-64 is usable only if index size or latency dominates quality.

Recommended serving configuration

Use document top-k pruning:

document_top_k = 128

Rationale:

setting Recall@10 nDCG@10 doc active dims
unpruned 67.5% 0.526 319.8
top-128 67.2% 0.521 126.5
top-64 64.8% 0.491 64.0

Top-128 gives almost the same retrieval quality as unpruned with a much smaller sparse index footprint.

Usage

from sentence_transformers.sparse_encoder import SparseEncoder

model = SparseEncoder("oneryalcin/financial-filings-sparse-encoder-v1")

query_vectors = model.encode_query([
    "What does the company say about liquidity risk?"
])

document_vectors = model.encode_document([
    "The company discusses liquidity and capital resources in the MD&A section..."
])

For production-style sparse indexing, keep the highest-weighted 128 dimensions per document vector before indexing.

Reproduction

The local runs used uv with a package-date cutoff:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/train_fin_sparse_encoder_v2.py \
  --train-size 20000 \
  --eval-size 1000 \
  --max-steps 1500 \
  --batch-size 8 \
  --eval-batch-size 16 \
  --query-reg 1e-4 \
  --doc-reg 1e-2 \
  --run-name fin-sparse-encoder-doc-v2-reg1e-2-1500

Triplet top-k evaluation:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_sparse_topk.py \
  --eval-size 1000 \
  --batch-size 16 \
  --topks none,128,64 \
  --models ./

Retrieval proxy evaluation:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_sparse_retrieval_proxy.py \
  --model ./ \
  --eval-size 1000 \
  --batch-size 16 \
  --topks none,128,64

BM25 baseline:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_bm25_retrieval_proxy.py \
  --eval-size 1000

Dataset preparation audit/export:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/prepare_fin_sparse_dataset.py \
  --split train \
  --output prepared/train_triplets.jsonl

The script also has an experimental --add-metadata-prefix option for future v2 data experiments. That option was not used for this v1 checkpoint.

Files in this repo

path purpose
scripts/prepare_fin_sparse_dataset.py Dataset filtering/export script that makes the v1 preprocessing explicit.
scripts/train_fin_sparse_encoder_v2.py Training script used for the final run.
scripts/eval_fin_sparse_topk.py Pairwise triplet accuracy with document top-k pruning.
scripts/eval_fin_sparse_retrieval_proxy.py In-memory retrieval proxy over held-out positives and negatives.
scripts/eval_fin_bm25_retrieval_proxy.py Local BM25 baseline over the same held-out candidate pool.
benchmarks/regularization_sweep_500_steps.json 500-step document-regularization sweep.
benchmarks/triplet_topk_1000.json Final triplet/top-k results.
benchmarks/retrieval_proxy_1000.json Final retrieval proxy and baseline results.

Limitations

These results are promising but not final production proof.

  • The retrieval benchmark is a proxy, not a full production OpenSearch/Elasticsearch benchmark.
  • Each query has only one labeled positive, so other genuinely relevant chunks may be counted as false competitors.
  • The candidate pool has 1,912 chunks, not millions.
  • Evaluation used the first 1,000 usable held-out examples, not the full test split.
  • BM25 is a local baseline, not a tuned OpenSearch BM25 setup.
  • End-to-end index size, latency, shard behavior, and hybrid retrieval have not been measured yet.

Next steps

  • Run the retrieval proxy on the full held-out test split.
  • Run a real OpenSearch/Elasticsearch index benchmark with top-128 sparse vectors.
  • Test hybrid BM25 + learned sparse retrieval.
  • Add query-type, company, and filing-type slice analysis.
  • Add human-labeled financial filing retrieval judgments.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oneryalcin/financial-filings-sparse-encoder-v1

Dataset used to train oneryalcin/financial-filings-sparse-encoder-v1