Financial Filings Sparse Encoder v1

This is a Sentence Transformers SparseEncoder / SPLADE-style model fine-tuned for financial filing retrieval.

The practical recommendation from the experiments below is to index document vectors after top-128 pruning. In the current proxy retrieval benchmark, top-128 keeps almost all unpruned quality while reducing each document to about 126 active sparse terms.

Naming note: the model is not document-only. It is an asymmetric query/document sparse encoder, following the OpenSearch model-family convention where the heavier document-side path is emphasized. Use encode_query(...) for online queries and encode_document(...) for offline document/chunk indexing.

TL;DR

setting	value
Base model	`opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill`
Dataset	`oneryalcin/financial-filings-sparse-retrieval-training`, `combined` config
Training recipe	`SpladeLoss(SparseMultipleNegativesRankingLoss)`
Final steps	1500
Recommended serving	document `top_k=128`
Triplet accuracy, top-128	78.0%
Retrieval proxy Recall@10, top-128	67.2%
Retrieval proxy nDCG@10, top-128	0.521

This is an experiment report as much as a model card. It describes what was tried, why decisions were made, what worked, and what remains unproven.

Why this exists

Financial filing search is not generic semantic similarity. Queries often refer to company events, accounting concepts, risk factors, segment details, and filing-specific language. A learned sparse model is attractive because it can improve ranking while preserving an Elasticsearch/OpenSearch-style sparse retrieval path.

The goal here was to train a domain-adapted sparse encoder that can be deployed as weighted sparse terms, then test whether the learned sparse signal is meaningfully better than both the base sparse model and a lexical BM25 baseline.

Starting point

This experiment started after the Sentence Transformers v5.5.0 release and its new train-sentence-transformers agent skill. The release made it easier to create a complete sparse-encoder training workflow: base model choice, loss selection, sparse regularization, top-k pruning checks, and model-card packaging.

Before training, the important constraints were:

Use a sparse encoder suitable for Elasticsearch/OpenSearch-style retrieval.
Keep vectors sparse enough to index in practice.
Use the user's existing Hugging Face dataset and local Apple Silicon machine.
Avoid overclaiming from pairwise training accuracy alone.

Base model decision

Chosen base model:

opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill

Reason: it already has separate query/document sparse encoding behavior and is aligned with OpenSearch neural sparse retrieval. Starting here means fine-tuning adapts a serving-compatible sparse model rather than building a new retrieval stack from scratch.

The doc wording in the base model name does not mean queries are encoded with the document encoder. This model should be used with the routed Sentence Transformers sparse API:

query_vectors = model.encode_query(queries)
document_vectors = model.encode_document(documents)

In practice, query encoding is the lightweight online side, while document encoding is the heavier offline/indexing side. The top-k pruning recommendation applies to document vectors before indexing.

Dataset

Dataset:

oneryalcin/financial-filings-sparse-retrieval-training

Config:

combined

Each usable row is treated as:

query
positive filing chunk
first non-empty hard negative filing chunk

Training used up to 20,000 requested rows. After dropping rows without a non-empty negative, there were 18,247 usable training triplets.

Evaluation used the first 1,000 usable rows from the held-out test split.

Known dataset fields include query, positive, negatives, query_type, company, and doc_type.

The exact v1 dataset preparation step is intentionally small and reproducible:

Load the configured split from the Hugging Face dataset.
Drop rows missing query, positive, or a non-empty negatives value.
Select the first non-empty negative from negatives.
Emit query, positive, negative, and available metadata.

No metadata prefixes, query rewrites, false-negative adjudication, mined negatives, or slice rebalancing were used for v1 training.

Training recipe

Model type:

sentence_transformers.sparse_encoder.SparseEncoder

Loss:

SpladeLoss(
  SparseMultipleNegativesRankingLoss,
  query_regularizer_weight=1e-4,
  document_regularizer_weight=1e-2,
)

Important settings:

setting	value
max steps	1500
batch size	8
train rows requested	20,000
usable train rows	18,247
query regularization	`1e-4`
document regularization	`1e-2`
sampler	no-duplicates batch sampler
platform	local Apple Silicon, MPS where supported
package cutoff	`uv run --exclude-newer=2026-05-13T00:00:00Z`

Why this recipe:

SparseMultipleNegativesRankingLoss teaches the model to rank the paired positive above in-batch negatives.
SpladeLoss adds FLOPS-style sparse regularization so the model does not emit unindexably dense document vectors.
A no-duplicates sampler avoids false negatives inside contrastive batches.
Document regularization matters more than query regularization for index footprint.

Decision log

decision point	choice	why
Retrieval family	`SparseEncoder` / SPLADE-style sparse retrieval	Best fit for Elasticsearch/OpenSearch sparse retrieval.
Base model	OpenSearch sparse doc v2 distill	Already query/document routed and serving-aligned.
Vocab	Keep base tokenizer/vocab	Changing vocabulary would be a larger pretraining-style project, not a fine-tune.
Loss	`SpladeLoss + SparseMultipleNegativesRankingLoss`	Combines ranking pressure with sparse regularization.
Sampler	no-duplicates	Reduces accidental false negatives in batch.
Doc regularization	increased to `1e-2`	Lower regularization gave high scores but thousands of active doc terms.
Serving vector size	`top_k=128`	Preserved almost all quality with much smaller document vectors.
Final run length	1500 steps	500-step runs were promising; longer training improved both triplet and retrieval proxy scores.
Evaluation strategy	triplet + retrieval proxy + base sparse + BM25	Avoids relying on a single easy metric.

Regularization sweep

The first 500-step sweep showed the main tradeoff. Low document regularization produced good triplet accuracy but extremely dense document vectors.

doc reg	triplet accuracy	positive doc dims	interpretation
`8e-5`	73.8%	4204.9	Too dense.
`1.5e-4`	74.8%	3214.8	Best raw 500-step accuracy, still too dense.
`3e-4`	73.4%	2101.4	Still too dense.
`1e-3`	72.7%	1340.5	Better, still high.
`3e-3`	73.4%	657.1	Practical direction.
`1e-2`	73.2%	296.7	Chosen for longer run.
`3e-2`	72.3%	160.9	Compact, lower quality.

The final model uses document_regularizer_weight=1e-2 and 1500 steps.

Evaluation 1: triplet accuracy

For each held-out row:

score(query, positive) > score(query, hard_negative)

Metric:

accuracy = fraction of rows where positive scores higher than the paired negative

model	doc pruning	accuracy	mean margin	query dims	positive doc dims
base sparse	unpruned	49.6%	0.451	15.2	371.1
fine-tuned sparse	unpruned	78.1%	3.004	15.2	340.9
fine-tuned sparse	top-128	78.0%	2.992	15.2	126.7
fine-tuned sparse	top-64	75.9%	2.988	15.2	64.0

Interpretation: the fine-tuned model learned the domain signal strongly. Top-128 preserved almost all pairwise quality.

Evaluation 2: in-memory retrieval proxy

A retrieval candidate pool was built from the held-out test split:

1,000 held-out queries
all unique positives from those rows
all unique first hard negatives from those rows
1,912 unique candidate chunks total

For each query, the known positive chunk is the only labeled relevant document. The model ranks all 1,912 candidate chunks by sparse dot product. BM25 ranks the same candidate corpus with a local lexical implementation.

Metrics:

Recall@1
Recall@5
Recall@10
Recall@20
MRR@10
nDCG@10
median rank

model	pruning	Recall@1	Recall@5	Recall@10	Recall@20	MRR@10	nDCG@10	median rank
fine-tuned sparse	unpruned	39.0%	58.9%	67.5%	75.0%	0.479	0.526	3
fine-tuned sparse	top-128	38.6%	57.8%	67.2%	73.9%	0.473	0.521	3
fine-tuned sparse	top-64	35.0%	55.7%	64.8%	72.4%	0.442	0.491	4
base sparse	unpruned	32.1%	56.5%	63.7%	69.9%	0.431	0.481	3
base sparse	top-128	31.4%	54.4%	62.7%	69.7%	0.422	0.472	3
base sparse	top-64	29.5%	52.6%	59.5%	66.5%	0.396	0.444	4
BM25	lexical	24.0%	58.2%	64.1%	68.6%	0.397	0.457	3

Interpretation:

Top-128 fine-tuned sparse is the best current deployment candidate.
BM25 is competitive at Recall@10, but worse at early precision and ranking quality.
Fine-tuning mostly improved early ranking: Recall@1 rose from 32.1% for the base sparse model to 39.0% unpruned, and 38.6% at top-128.
Top-64 is usable only if index size or latency dominates quality.

Recommended serving configuration

Use document top-k pruning:

document_top_k = 128

Rationale:

setting	Recall@10	nDCG@10	doc active dims
unpruned	67.5%	0.526	319.8
top-128	67.2%	0.521	126.5
top-64	64.8%	0.491	64.0

Top-128 gives almost the same retrieval quality as unpruned with a much smaller sparse index footprint.

Usage

from sentence_transformers.sparse_encoder import SparseEncoder

model = SparseEncoder("oneryalcin/financial-filings-sparse-encoder-v1")

query_vectors = model.encode_query([
    "What does the company say about liquidity risk?"
])

document_vectors = model.encode_document([
    "The company discusses liquidity and capital resources in the MD&A section..."
])

For production-style sparse indexing, keep the highest-weighted 128 dimensions per document vector before indexing.

Reproduction

The local runs used uv with a package-date cutoff:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/train_fin_sparse_encoder_v2.py \
  --train-size 20000 \
  --eval-size 1000 \
  --max-steps 1500 \
  --batch-size 8 \
  --eval-batch-size 16 \
  --query-reg 1e-4 \
  --doc-reg 1e-2 \
  --run-name fin-sparse-encoder-doc-v2-reg1e-2-1500

Triplet top-k evaluation:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_sparse_topk.py \
  --eval-size 1000 \
  --batch-size 16 \
  --topks none,128,64 \
  --models ./

Retrieval proxy evaluation:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_sparse_retrieval_proxy.py \
  --model ./ \
  --eval-size 1000 \
  --batch-size 16 \
  --topks none,128,64

BM25 baseline:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_bm25_retrieval_proxy.py \
  --eval-size 1000

Dataset preparation audit/export:

uv run --exclude-newer=2026-05-13T00:00:00Z scripts/prepare_fin_sparse_dataset.py \
  --split train \
  --output prepared/train_triplets.jsonl

The script also has an experimental --add-metadata-prefix option for future v2 data experiments. That option was not used for this v1 checkpoint.

Files in this repo

path	purpose
`scripts/prepare_fin_sparse_dataset.py`	Dataset filtering/export script that makes the v1 preprocessing explicit.
`scripts/train_fin_sparse_encoder_v2.py`	Training script used for the final run.
`scripts/eval_fin_sparse_topk.py`	Pairwise triplet accuracy with document top-k pruning.
`scripts/eval_fin_sparse_retrieval_proxy.py`	In-memory retrieval proxy over held-out positives and negatives.
`scripts/eval_fin_bm25_retrieval_proxy.py`	Local BM25 baseline over the same held-out candidate pool.
`benchmarks/regularization_sweep_500_steps.json`	500-step document-regularization sweep.
`benchmarks/triplet_topk_1000.json`	Final triplet/top-k results.
`benchmarks/retrieval_proxy_1000.json`	Final retrieval proxy and baseline results.

Limitations

These results are promising but not final production proof.

The retrieval benchmark is a proxy, not a full production OpenSearch/Elasticsearch benchmark.
Each query has only one labeled positive, so other genuinely relevant chunks may be counted as false competitors.
The candidate pool has 1,912 chunks, not millions.
Evaluation used the first 1,000 usable held-out examples, not the full test split.
BM25 is a local baseline, not a tuned OpenSearch BM25 setup.
End-to-end index size, latency, shard behavior, and hybrid retrieval have not been measured yet.

Next steps

Run the retrieval proxy on the full held-out test split.
Run a real OpenSearch/Elasticsearch index benchmark with top-128 sparse vectors.
Test hybrid BM25 + learned sparse retrieval.
Add query-type, company, and filing-type slice analysis.
Add human-labeled financial filing retrieval judgments.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for oneryalcin/financial-filings-sparse-encoder-v1

Base model

opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill

Finetuned

(1)

this model

oneryalcin
/

financial-filings-sparse-encoder-v1