Instructions to use oneryalcin/financial-filings-sparse-encoder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use oneryalcin/financial-filings-sparse-encoder-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("oneryalcin/financial-filings-sparse-encoder-v1") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Financial Filings Sparse Encoder v1
This is a Sentence Transformers SparseEncoder / SPLADE-style model fine-tuned for financial filing retrieval.
The practical recommendation from the experiments below is to index document vectors after top-128 pruning. In the current proxy retrieval benchmark, top-128 keeps almost all unpruned quality while reducing each document to about 126 active sparse terms.
Naming note: the model is not document-only. It is an asymmetric query/document sparse encoder, following the OpenSearch model-family convention where the heavier document-side path is emphasized. Use encode_query(...) for online queries and encode_document(...) for offline document/chunk indexing.
TL;DR
| setting | value |
|---|---|
| Base model | opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill |
| Dataset | oneryalcin/financial-filings-sparse-retrieval-training, combined config |
| Training recipe | SpladeLoss(SparseMultipleNegativesRankingLoss) |
| Final steps | 1500 |
| Recommended serving | document top_k=128 |
| Triplet accuracy, top-128 | 78.0% |
| Retrieval proxy Recall@10, top-128 | 67.2% |
| Retrieval proxy nDCG@10, top-128 | 0.521 |
This is an experiment report as much as a model card. It describes what was tried, why decisions were made, what worked, and what remains unproven.
Why this exists
Financial filing search is not generic semantic similarity. Queries often refer to company events, accounting concepts, risk factors, segment details, and filing-specific language. A learned sparse model is attractive because it can improve ranking while preserving an Elasticsearch/OpenSearch-style sparse retrieval path.
The goal here was to train a domain-adapted sparse encoder that can be deployed as weighted sparse terms, then test whether the learned sparse signal is meaningfully better than both the base sparse model and a lexical BM25 baseline.
Starting point
This experiment started after the Sentence Transformers v5.5.0 release and its new train-sentence-transformers agent skill. The release made it easier to create a complete sparse-encoder training workflow: base model choice, loss selection, sparse regularization, top-k pruning checks, and model-card packaging.
Before training, the important constraints were:
- Use a sparse encoder suitable for Elasticsearch/OpenSearch-style retrieval.
- Keep vectors sparse enough to index in practice.
- Use the user's existing Hugging Face dataset and local Apple Silicon machine.
- Avoid overclaiming from pairwise training accuracy alone.
Base model decision
Chosen base model:
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
Reason: it already has separate query/document sparse encoding behavior and is aligned with OpenSearch neural sparse retrieval. Starting here means fine-tuning adapts a serving-compatible sparse model rather than building a new retrieval stack from scratch.
The doc wording in the base model name does not mean queries are encoded with the document encoder. This model should be used with the routed Sentence Transformers sparse API:
query_vectors = model.encode_query(queries)
document_vectors = model.encode_document(documents)
In practice, query encoding is the lightweight online side, while document encoding is the heavier offline/indexing side. The top-k pruning recommendation applies to document vectors before indexing.
Dataset
Dataset:
oneryalcin/financial-filings-sparse-retrieval-training
Config:
combined
Each usable row is treated as:
query
positive filing chunk
first non-empty hard negative filing chunk
Training used up to 20,000 requested rows. After dropping rows without a non-empty negative, there were 18,247 usable training triplets.
Evaluation used the first 1,000 usable rows from the held-out test split.
Known dataset fields include query, positive, negatives, query_type, company, and doc_type.
The exact v1 dataset preparation step is intentionally small and reproducible:
- Load the configured split from the Hugging Face dataset.
- Drop rows missing
query,positive, or a non-emptynegativesvalue. - Select the first non-empty negative from
negatives. - Emit
query,positive,negative, and available metadata.
No metadata prefixes, query rewrites, false-negative adjudication, mined negatives, or slice rebalancing were used for v1 training.
Training recipe
Model type:
sentence_transformers.sparse_encoder.SparseEncoder
Loss:
SpladeLoss(
SparseMultipleNegativesRankingLoss,
query_regularizer_weight=1e-4,
document_regularizer_weight=1e-2,
)
Important settings:
| setting | value |
|---|---|
| max steps | 1500 |
| batch size | 8 |
| train rows requested | 20,000 |
| usable train rows | 18,247 |
| query regularization | 1e-4 |
| document regularization | 1e-2 |
| sampler | no-duplicates batch sampler |
| platform | local Apple Silicon, MPS where supported |
| package cutoff | uv run --exclude-newer=2026-05-13T00:00:00Z |
Why this recipe:
SparseMultipleNegativesRankingLossteaches the model to rank the paired positive above in-batch negatives.SpladeLossadds FLOPS-style sparse regularization so the model does not emit unindexably dense document vectors.- A no-duplicates sampler avoids false negatives inside contrastive batches.
- Document regularization matters more than query regularization for index footprint.
Decision log
| decision point | choice | why |
|---|---|---|
| Retrieval family | SparseEncoder / SPLADE-style sparse retrieval |
Best fit for Elasticsearch/OpenSearch sparse retrieval. |
| Base model | OpenSearch sparse doc v2 distill | Already query/document routed and serving-aligned. |
| Vocab | Keep base tokenizer/vocab | Changing vocabulary would be a larger pretraining-style project, not a fine-tune. |
| Loss | SpladeLoss + SparseMultipleNegativesRankingLoss |
Combines ranking pressure with sparse regularization. |
| Sampler | no-duplicates | Reduces accidental false negatives in batch. |
| Doc regularization | increased to 1e-2 |
Lower regularization gave high scores but thousands of active doc terms. |
| Serving vector size | top_k=128 |
Preserved almost all quality with much smaller document vectors. |
| Final run length | 1500 steps | 500-step runs were promising; longer training improved both triplet and retrieval proxy scores. |
| Evaluation strategy | triplet + retrieval proxy + base sparse + BM25 | Avoids relying on a single easy metric. |
Regularization sweep
The first 500-step sweep showed the main tradeoff. Low document regularization produced good triplet accuracy but extremely dense document vectors.
| doc reg | triplet accuracy | positive doc dims | interpretation |
|---|---|---|---|
8e-5 |
73.8% | 4204.9 | Too dense. |
1.5e-4 |
74.8% | 3214.8 | Best raw 500-step accuracy, still too dense. |
3e-4 |
73.4% | 2101.4 | Still too dense. |
1e-3 |
72.7% | 1340.5 | Better, still high. |
3e-3 |
73.4% | 657.1 | Practical direction. |
1e-2 |
73.2% | 296.7 | Chosen for longer run. |
3e-2 |
72.3% | 160.9 | Compact, lower quality. |
The final model uses document_regularizer_weight=1e-2 and 1500 steps.
Evaluation 1: triplet accuracy
For each held-out row:
score(query, positive) > score(query, hard_negative)
Metric:
accuracy = fraction of rows where positive scores higher than the paired negative
| model | doc pruning | accuracy | mean margin | query dims | positive doc dims |
|---|---|---|---|---|---|
| base sparse | unpruned | 49.6% | 0.451 | 15.2 | 371.1 |
| fine-tuned sparse | unpruned | 78.1% | 3.004 | 15.2 | 340.9 |
| fine-tuned sparse | top-128 | 78.0% | 2.992 | 15.2 | 126.7 |
| fine-tuned sparse | top-64 | 75.9% | 2.988 | 15.2 | 64.0 |
Interpretation: the fine-tuned model learned the domain signal strongly. Top-128 preserved almost all pairwise quality.
Evaluation 2: in-memory retrieval proxy
A retrieval candidate pool was built from the held-out test split:
- 1,000 held-out queries
- all unique positives from those rows
- all unique first hard negatives from those rows
- 1,912 unique candidate chunks total
For each query, the known positive chunk is the only labeled relevant document. The model ranks all 1,912 candidate chunks by sparse dot product. BM25 ranks the same candidate corpus with a local lexical implementation.
Metrics:
Recall@1Recall@5Recall@10Recall@20MRR@10nDCG@10- median rank
| model | pruning | Recall@1 | Recall@5 | Recall@10 | Recall@20 | MRR@10 | nDCG@10 | median rank |
|---|---|---|---|---|---|---|---|---|
| fine-tuned sparse | unpruned | 39.0% | 58.9% | 67.5% | 75.0% | 0.479 | 0.526 | 3 |
| fine-tuned sparse | top-128 | 38.6% | 57.8% | 67.2% | 73.9% | 0.473 | 0.521 | 3 |
| fine-tuned sparse | top-64 | 35.0% | 55.7% | 64.8% | 72.4% | 0.442 | 0.491 | 4 |
| base sparse | unpruned | 32.1% | 56.5% | 63.7% | 69.9% | 0.431 | 0.481 | 3 |
| base sparse | top-128 | 31.4% | 54.4% | 62.7% | 69.7% | 0.422 | 0.472 | 3 |
| base sparse | top-64 | 29.5% | 52.6% | 59.5% | 66.5% | 0.396 | 0.444 | 4 |
| BM25 | lexical | 24.0% | 58.2% | 64.1% | 68.6% | 0.397 | 0.457 | 3 |
Interpretation:
- Top-128 fine-tuned sparse is the best current deployment candidate.
- BM25 is competitive at Recall@10, but worse at early precision and ranking quality.
- Fine-tuning mostly improved early ranking: Recall@1 rose from 32.1% for the base sparse model to 39.0% unpruned, and 38.6% at top-128.
- Top-64 is usable only if index size or latency dominates quality.
Recommended serving configuration
Use document top-k pruning:
document_top_k = 128
Rationale:
| setting | Recall@10 | nDCG@10 | doc active dims |
|---|---|---|---|
| unpruned | 67.5% | 0.526 | 319.8 |
| top-128 | 67.2% | 0.521 | 126.5 |
| top-64 | 64.8% | 0.491 | 64.0 |
Top-128 gives almost the same retrieval quality as unpruned with a much smaller sparse index footprint.
Usage
from sentence_transformers.sparse_encoder import SparseEncoder
model = SparseEncoder("oneryalcin/financial-filings-sparse-encoder-v1")
query_vectors = model.encode_query([
"What does the company say about liquidity risk?"
])
document_vectors = model.encode_document([
"The company discusses liquidity and capital resources in the MD&A section..."
])
For production-style sparse indexing, keep the highest-weighted 128 dimensions per document vector before indexing.
Reproduction
The local runs used uv with a package-date cutoff:
uv run --exclude-newer=2026-05-13T00:00:00Z scripts/train_fin_sparse_encoder_v2.py \
--train-size 20000 \
--eval-size 1000 \
--max-steps 1500 \
--batch-size 8 \
--eval-batch-size 16 \
--query-reg 1e-4 \
--doc-reg 1e-2 \
--run-name fin-sparse-encoder-doc-v2-reg1e-2-1500
Triplet top-k evaluation:
uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_sparse_topk.py \
--eval-size 1000 \
--batch-size 16 \
--topks none,128,64 \
--models ./
Retrieval proxy evaluation:
uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_sparse_retrieval_proxy.py \
--model ./ \
--eval-size 1000 \
--batch-size 16 \
--topks none,128,64
BM25 baseline:
uv run --exclude-newer=2026-05-13T00:00:00Z scripts/eval_fin_bm25_retrieval_proxy.py \
--eval-size 1000
Dataset preparation audit/export:
uv run --exclude-newer=2026-05-13T00:00:00Z scripts/prepare_fin_sparse_dataset.py \
--split train \
--output prepared/train_triplets.jsonl
The script also has an experimental --add-metadata-prefix option for future v2 data experiments. That option was not used for this v1 checkpoint.
Files in this repo
| path | purpose |
|---|---|
scripts/prepare_fin_sparse_dataset.py |
Dataset filtering/export script that makes the v1 preprocessing explicit. |
scripts/train_fin_sparse_encoder_v2.py |
Training script used for the final run. |
scripts/eval_fin_sparse_topk.py |
Pairwise triplet accuracy with document top-k pruning. |
scripts/eval_fin_sparse_retrieval_proxy.py |
In-memory retrieval proxy over held-out positives and negatives. |
scripts/eval_fin_bm25_retrieval_proxy.py |
Local BM25 baseline over the same held-out candidate pool. |
benchmarks/regularization_sweep_500_steps.json |
500-step document-regularization sweep. |
benchmarks/triplet_topk_1000.json |
Final triplet/top-k results. |
benchmarks/retrieval_proxy_1000.json |
Final retrieval proxy and baseline results. |
Limitations
These results are promising but not final production proof.
- The retrieval benchmark is a proxy, not a full production OpenSearch/Elasticsearch benchmark.
- Each query has only one labeled positive, so other genuinely relevant chunks may be counted as false competitors.
- The candidate pool has 1,912 chunks, not millions.
- Evaluation used the first 1,000 usable held-out examples, not the full test split.
- BM25 is a local baseline, not a tuned OpenSearch BM25 setup.
- End-to-end index size, latency, shard behavior, and hybrid retrieval have not been measured yet.
Next steps
- Run the retrieval proxy on the full held-out test split.
- Run a real OpenSearch/Elasticsearch index benchmark with top-128 sparse vectors.
- Test hybrid BM25 + learned sparse retrieval.
- Add query-type, company, and filing-type slice analysis.
- Add human-labeled financial filing retrieval judgments.