static-retrieval-cve-en-v1

A 1024-dim StaticEmbedding model fine-tuned for retrieving CVE descriptions by vulnerability class (same CWE). 41M parameters, runs ~80× faster on CPU than typical sentence-transformer encoders.


Architecture	`StaticEmbedding(EmbeddingBag(40522, 1024))`
Output dim	1024 (Matryoshka-truncatable to 512 / 256 / 128 / 64 / 32)
Vocab	bert-base-uncased (30,522) + 10,000 corpus-mined security-specific tokens
Training data	521k pairs from real CVE corpus (same-CWE + CWE-definition pairs)
Trained on	M4 Max MPS, 12 min wall time
Best for	balanced general use — strong on common-CWE CVEs

This is v1 of a multi-phase ablation series to push static-embedding quality on the CVE corpus. The full result trio (v1 / v2-longtail) covers a Pareto frontier of recall vs precision trade-offs; this model is the strongest baseline before synthetic-data augmentation. See oneryalcin/static-retrieval-cve-en-v2-longtail for the rare-class-improved variant.

⚠️ Scope and limitations (read first)

This model is specialized for one narrow task and does NOT generalize to all CVE retrieval use cases. Please read this section before adopting it.

What this IS for

Given an NVD-formatted CVE description (~300-char paragraph, NVD prose style), retrieve other CVEs sharing the same MITRE CWE class from a corpus of CVE descriptions. On that task it dominates general-purpose models — see numbers below.

Real deployment fits include:

First-stage CWE-tagging — untagged CVE in, same-CWE peers out, vote on class
Active learning / borderline-case mining for CWE classifiers
CWE-balanced retrieval when you want diverse vulnerability classes rather than a flood of one type
Cheap full-corpus embedding on CPU (185× faster than mpnet, encodes all of NVD in ~70 sec)

What this is NOT for

We measured these explicitly and our model underperforms off-the-shelf alternatives here:

Use case	Our model	Off-the-shelf alternative	Recommendation
Famous-vulnerability lookup ("log4shell", "heartbleed")	0% Hit@10 on CVE-ID names	All models fail (need a glossary, not embeddings)	Use a CVE-ID lookup index, not embedding retrieval
Product/keyword search ("wordpress XSS", "apache RCE")	42% Acc@1	`static-retrieval-mrl-en-v1`: 79% • mpnet: 84%	Use the off-the-shelf models
Short informal queries (security analyst typing)	41-45% overall Acc@1	`static-retrieval-mrl-en-v1`: 54% • mpnet: 54%	Use the off-the-shelf models
Cross-format retrieval (GHSA / vendor advisories with short summary → long markdown)	0.87-0.89 NDCG@10	`static-retrieval-mrl-en-v1`: 0.91	Use the off-the-shelf static model

If your use case looks like the bottom table rows, use sentence-transformers/static-retrieval-mrl-en-v1 or BAAI/bge-large-en-v1.5 instead. Both are competitive or better on out-of-distribution security text.

Why the specialization trade-off exists

We trained this model on contrastive pairs of (CVE_i, CVE_j) where both descriptions share the same MITRE CWE class. The corpus-mined special tokens that boost in-domain performance (e.g. "stack-based buffer overflow" as a single token) help cluster CVEs by vulnerability class but reduce the model's ability to distinguish them by product, vendor, or topic. This is a classic narrow-task fine-tuning trade-off: dramatic gain on the trained metric, modest regression on adjacent tasks.

A v3 with mixed training objectives (same-CWE + product-similarity + cross-format pairs) could likely close the OOD gap, but does not exist yet.

Quick start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oneryalcin/static-retrieval-cve-en-v1")

cves = [
    "Buffer overflow in Apache HTTP Server 2.4.49 mod_proxy allows remote attackers to execute arbitrary code via crafted Content-Length headers.",
    "Cross-site scripting (XSS) vulnerability in WordPress plugin XYZ 3.1.0 allows attackers to inject arbitrary JavaScript via the search parameter.",
    "SQL injection in MyApp 5.0 login form allows authentication bypass via crafted username field.",
]
embeddings = model.encode(cves)
print(embeddings.shape)   # (3, 1024)

# Similarity
sim = model.similarity(embeddings, embeddings)
print(sim)

For lower-dim usage (faster retrieval at minimal quality cost), use Matryoshka truncation:

model = SentenceTransformer("oneryalcin/static-retrieval-cve-en-v1", truncate_dim=256)
embeddings = model.encode(cves)
print(embeddings.shape)   # (3, 256)

Performance

Held-out CVE → same-CWE retrieval

Deterministic split: 297 queries, 5,000 corpus, seed=42, no leakage. Comparison vs the off-the-shelf sentence-transformers/static-retrieval-mrl-en-v1 (50M general-English contrastive pairs):

Model	NDCG@10	MRR@10	Acc@1
Off-the-shelf `static-retrieval-mrl-en-v1`	0.3549	0.4979	0.4276
This model (v1)	0.5811	0.6814	0.6128
Delta	+0.226 (+64%)	+0.184	+0.185

Time-aware evaluation (true zero-shot)

Train on pre-2024 CVEs, eval on 2024+ CVEs (493 queries, 8000-doc mixed corpus). This is closer to real-world use:

Model	NDCG@10	Acc@1
Off-the-shelf	0.4511	0.5010
This model	0.7178	0.7241

The gap widens under time-aware eval — newer CVE descriptions follow more boilerplate phrasing that exactly matches this model's mined special tokens.

Per-CWE class-size breakdown

Class size (real-CVE pairs)  Acc@1   NDCG@10   Hit@10
very rare (<50)              0.143   0.274     0.43
rare (50-200)                0.083   0.051     0.083
med (200-1000)               0.375   0.362     0.700
large (>=1000)               0.693   0.654     0.895

Limitation: v1 performs near-baseline on rare CWE classes (<200 pairs). If long-tail recall matters for your application, prefer oneryalcin/static-retrieval-cve-en-v2-longtail.

Training recipe

Architecture:        StaticEmbedding(40522 × 1024) ≈ 41M params
Base tokenizer:      google-bert/bert-base-uncased
Special tokens:      10,000 corpus-mined (uni-/bi-/tri-grams that fragment ≥3 subwords AND occur ≥5 times)
                       e.g. "cross-site scripting vulnerability" (3 → 1 token),
                            "stack-based buffer overflow" (5 → 1 token),
                            "x.509" (4 → 1 token), "out-of-bounds" (5 → 1 token)
Loss:                MatryoshkaLoss(MultipleNegativesRankingLoss, dims=[1024,768,512,256,128,64,32])
Training pairs:      521k total
                       469k same-CWE pairs (5 deduplicated iterations, capped 5000/CWE)
                        52k CVE→CWE-definition pairs (from MITRE CWE catalog)
Batch:               512 (in-batch negatives = 511 per anchor)
Sampler:             NO_DUPLICATES (mandatory for in-batch negatives)
LR:                  2e-1 (high, standard for StaticEmbedding random init)
Epochs:              3 (peaked late epoch 1, oscillated thereafter)
Optimizer:           AdamW + 5% linear warmup, linear decay
Precision:           fp32 (MPS-safe; bf16 unreliable on Apple silicon)
Hardware:            M4 Max MPS, 12 min wall

Retrieval quality vs general-purpose encoders

Same held-out CVE→same-CWE eval (297 queries, 5,000 corpus, seed=42), comparing to popular general-English sentence-transformers:

Model	Params	Dim	NDCG@10	MRR@10	Acc@1
`sentence-transformers/static-retrieval-mrl-en-v1` (reference static)	31M	1024	0.355	0.498	0.428
`sentence-transformers/all-mpnet-base-v2` (typical transformer)	110M	768	0.363	0.487	0.414
`sentence-transformers/all-MiniLM-L6-v2` (small transformer)	22M	384	0.374	0.505	0.434
`BAAI/bge-large-en-v1.5` (strong general-purpose, SOTA-class)	335M	1024	0.408	0.548	0.488
`oneryalcin/static-retrieval-cve-en-v1` (THIS MODEL)	41M	1024	0.581 (+42% over bge)	0.681	0.613
`oneryalcin/static-retrieval-cve-en-v2-longtail` (sibling — synth + balanced sampling)	41M	1024	0.568 (+39% over bge)	0.678	0.620

Domain training beats general-purpose architecture here. Both our 41M-param token-bag models outperform a 335M-param SOTA general-English encoder (bge-large-en-v1.5) by a wide margin on in-domain CVE retrieval — and run ~185× faster on CPU than mpnet-base-v2 (see speed table above). This is not a quality-vs-speed trade-off: for CVE retrieval by CWE, the domain-tuned static model dominates on both axes.

Pick between v1 (best aggregate NDCG, strong on common CWEs) and v2-longtail (best Acc@1, doubles rare-class precision via synthetic-data augmentation) — see comparison details below.

Benchmark reproducer: quality_bench.py in this repo.

Inference speed (CPU)

Benchmarked on Apple M4 Max CPU, encoding 1000 real CVE descriptions (avg 328 chars), batch=128, best of 3 runs:

Model	Dim	q/s on CPU	Time for 1M CVEs	vs mpnet-base-v2
This model	1024	~14,000	~70 sec	~185×
`sentence-transformers/static-retrieval-mrl-en-v1` (reference static)	1024	~13,700	~73 sec	~178×
`sentence-transformers/all-MiniLM-L6-v2` (small transformer)	384	~600	~27 min	~8×
`sentence-transformers/all-mpnet-base-v2` (typical transformer)	768	~77	~3.6 hours	1×

Same architecture as the reference static model so the speed matches it. The entire current NVD CVE catalog (~300k entries) encodes in under 25 seconds on a single CPU core — no GPU required for deployment. (On Apple Silicon, CPU outperforms MPS for this model since StaticEmbedding is memory-bound rather than compute-bound — the GPU transfer overhead exceeds the parallelism gain.)

Benchmark reproducer: speed_bench.py in this repo.

Out-of-distribution generalization (measured)

To check whether the in-distribution wins above generalize, we ran the model on three out-of-distribution tests. Results posted honestly so you can judge fit for your use case.

Test 1: GHSA cross-corpus, summary → description (asymmetric, short-query)

GitHub Security Advisories (gh api /advisories) have a different style than NVD (Markdown sections, ~10× longer descriptions, ~64-char summaries). We sampled 10k GHSAs, took 300 short summaries as queries against a 5,000-doc corpus of GHSA descriptions, same-CWE retrieval:

Model	NDCG@10	Acc@1
`sentence-transformers/static-retrieval-mrl-en-v1` (baseline)	0.910	0.910
`sentence-transformers/all-mpnet-base-v2`	0.903	0.903
v2-longtail (ours)	0.887	0.887
`sentence-transformers/all-MiniLM-L6-v2`	0.873	0.873
v1 (ours)	0.867	0.867

Off-the-shelf baseline wins. Our domain-tuned models slot in the middle. The lift we get on in-distribution evals (NVD descriptions) does not transfer to short-query GHSA retrieval.

Reproducer: eval_ghsa.py + fetch_ghsa.py in this repo.

Test 2: Hand-built informal queries against NVD corpus

100 short informal queries representing how a security analyst, SOC operator, or threat-intel researcher would actually search. Categories: famous-vuln names (log4shell), CWE-class descriptions (SQL injection in login form), product mentions (wordpress plugin XSS), question form (how does TOCTOU work), typos. For each, an "expected" target (CVE-ID, CWE-ID, or keyword) was hand-specified. Searched against 30,000 real NVD CVEs.

Model	Acc@1 (top-1 hit)	Hit@10 (in top-10)
`sentence-transformers/static-retrieval-mrl-en-v1` (baseline)	0.538	0.721
`sentence-transformers/all-mpnet-base-v2`	0.538	0.731
v1 (ours)	0.452	0.654
v2-longtail (ours)	0.413	0.635

Breakdown by query type:

Query type (n)	v1	v2-longtail	baseline	mpnet
Famous vuln names ("log4shell") (27)	0.000	0.000	0.000	0.000
CWE-class queries ("SQL injection") (39)	0.692 (best)	0.692	0.667	0.615
Product/keyword ("wordpress XSS") (38)	0.421	0.421	0.789	0.842

Famous vulnerability names: every model scores zero. Pop-culture names like "log4shell", "heartbleed", "zerologon" don't appear in CVE description text. This needs a name → CVE-ID lookup table, not embedding retrieval.
CWE-class queries: our v2-longtail wins. Strongest for class-level semantic queries.
Product/keyword queries: our models LOSE by 30+ percentage points. Our embedding geometry clusters by CWE-class, not by product/vendor. Asking "find wordpress vulnerabilities" returns 3-4 of 10 actual WordPress CVEs (the rest are same-vuln-class CVEs in other products). The baseline returns 7-8.

Reproducer: informal_queries.py in this repo.

Takeaway

Our model is overspecialized to NVD-style same-CWE retrieval. The specialization that lets us beat bge-large by 40% on the trained task also makes the model less useful for general security search. For programmatic CVE→CWE retrieval pipelines this is a feature. For human-typed security search this is a bug. Pick accordingly.

Reproducibility

The training script train.py is included in this repo. To reproduce from scratch:

pip install "sentence-transformers[train]>=5.5" "datasets>=2.19" "tokenizers>=0.20" "accelerate>=0.26"
python train.py

Training is deterministic (seed=42). On comparable hardware you should reach NDCG@10 within ±0.01 of 0.5811.

Companion evaluation script eval_per_cwe.py is also included — runs the per-CWE breakdown table above against any saved model.

Method notes

This model was built using the train-sentence-transformers v5.5.0 agent skill. Three findings drove the recipe:

Corpus-mined special tokens dominate the quality gains. Adding 10k frequency-weighted tokens dropped query subword count by 12.8% and contributed most of the lift from 0.355 → 0.581 NDCG@10.
Multi-iteration same-CWE pair construction (5 deduplicated shuffles vs single shuffle) added meaningful contrastive variance — pairs grew from 94k → 469k for the same source data.
CWE-definition pairs (CVE description ↔ MITRE CWE name+description text) provided a "categorical anchor" signal complementing same-CWE peers.

Related artifacts

Repo	Purpose
`oneryalcin/static-retrieval-cve-en-v2-longtail`	Successor with synthetic-data augmentation for rare CWEs (doubles rare-class Acc@1)
`oneryalcin/cve-synthetic-rare-cwes-v1`	The 4,417 synthetic CVEs used to train v2-longtail

Limitations

Rare-class blind spot. Acc@1 on CWE classes with <200 real pairs is 0.083, essentially random.
No negation / sentiment handling. Static embeddings cannot distinguish "exceeded expectations" from "missed expectations" — for security retrieval where directional words matter, pair with a cross-encoder reranker.
English only. All training data is English-language CVE descriptions.
CVE-ID strings are not whole tokens. CVE-2021-44228 still fragments — special-token mining favored class-vocab over per-CVE identifiers (which would balloon the embedding table).
Some "misses" reflect label ambiguity. CWE-189 ↔ CWE-190 and CWE-264 ↔ CWE-287 are confusion pairs where the ground-truth CWE labels themselves are partially overlapping in the MITRE hierarchy.

License

Apache-2.0 (matching the underlying dataset and CWE catalog).

Citation

@misc{oneryalcin_cve_static_p3,
  author = {Mehmet Oner Yalcin},
  title  = {static-retrieval-cve-en-v1: a static-embedding CVE retrieval model},
  year   = {2026},
  howpublished = {Hugging Face model},
  url    = {https://huggingface.co/oneryalcin/static-retrieval-cve-en-v1},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

41.5M params

Tensor type

F32

Model tree for oneryalcin/static-retrieval-cve-en-v1

Base model

google-bert/bert-base-uncased

Finetuned

(6748)

this model

Dataset used to train oneryalcin/static-retrieval-cve-en-v1

Evaluation results

NDCG@10 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.581
MRR@10 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.681
Accuracy@1 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.613
NDCG@10 (time-aware, 2024+ queries) on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.718
Accuracy@1 (time-aware, 2024+ queries) on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.724