Instructions to use oneryalcin/static-retrieval-cve-en-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use oneryalcin/static-retrieval-cve-en-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("oneryalcin/static-retrieval-cve-en-v1") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
static-retrieval-cve-en-v1
A 1024-dim StaticEmbedding model fine-tuned for retrieving CVE descriptions by vulnerability class (same CWE). 41M parameters, runs ~80× faster on CPU than typical sentence-transformer encoders.
| Architecture | StaticEmbedding(EmbeddingBag(40522, 1024)) |
| Output dim | 1024 (Matryoshka-truncatable to 512 / 256 / 128 / 64 / 32) |
| Vocab | bert-base-uncased (30,522) + 10,000 corpus-mined security-specific tokens |
| Training data | 521k pairs from real CVE corpus (same-CWE + CWE-definition pairs) |
| Trained on | M4 Max MPS, 12 min wall time |
| Best for | balanced general use — strong on common-CWE CVEs |
This is v1 of a multi-phase ablation series to push static-embedding quality on the CVE corpus. The full result trio (v1 / v2-longtail) covers a Pareto frontier of recall vs precision trade-offs; this model is the strongest baseline before synthetic-data augmentation. See oneryalcin/static-retrieval-cve-en-v2-longtail for the rare-class-improved variant.
⚠️ Scope and limitations (read first)
This model is specialized for one narrow task and does NOT generalize to all CVE retrieval use cases. Please read this section before adopting it.
What this IS for
Given an NVD-formatted CVE description (~300-char paragraph, NVD prose style), retrieve other CVEs sharing the same MITRE CWE class from a corpus of CVE descriptions. On that task it dominates general-purpose models — see numbers below.
Real deployment fits include:
- First-stage CWE-tagging — untagged CVE in, same-CWE peers out, vote on class
- Active learning / borderline-case mining for CWE classifiers
- CWE-balanced retrieval when you want diverse vulnerability classes rather than a flood of one type
- Cheap full-corpus embedding on CPU (185× faster than mpnet, encodes all of NVD in ~70 sec)
What this is NOT for
We measured these explicitly and our model underperforms off-the-shelf alternatives here:
| Use case | Our model | Off-the-shelf alternative | Recommendation |
|---|---|---|---|
| Famous-vulnerability lookup ("log4shell", "heartbleed") | 0% Hit@10 on CVE-ID names | All models fail (need a glossary, not embeddings) | Use a CVE-ID lookup index, not embedding retrieval |
| Product/keyword search ("wordpress XSS", "apache RCE") | 42% Acc@1 | static-retrieval-mrl-en-v1: 79% • mpnet: 84% |
Use the off-the-shelf models |
| Short informal queries (security analyst typing) | 41-45% overall Acc@1 | static-retrieval-mrl-en-v1: 54% • mpnet: 54% |
Use the off-the-shelf models |
| Cross-format retrieval (GHSA / vendor advisories with short summary → long markdown) | 0.87-0.89 NDCG@10 | static-retrieval-mrl-en-v1: 0.91 |
Use the off-the-shelf static model |
If your use case looks like the bottom table rows, use sentence-transformers/static-retrieval-mrl-en-v1 or BAAI/bge-large-en-v1.5 instead. Both are competitive or better on out-of-distribution security text.
Why the specialization trade-off exists
We trained this model on contrastive pairs of (CVE_i, CVE_j) where both descriptions share the same MITRE CWE class. The corpus-mined special tokens that boost in-domain performance (e.g. "stack-based buffer overflow" as a single token) help cluster CVEs by vulnerability class but reduce the model's ability to distinguish them by product, vendor, or topic. This is a classic narrow-task fine-tuning trade-off: dramatic gain on the trained metric, modest regression on adjacent tasks.
A v3 with mixed training objectives (same-CWE + product-similarity + cross-format pairs) could likely close the OOD gap, but does not exist yet.
Quick start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("oneryalcin/static-retrieval-cve-en-v1")
cves = [
"Buffer overflow in Apache HTTP Server 2.4.49 mod_proxy allows remote attackers to execute arbitrary code via crafted Content-Length headers.",
"Cross-site scripting (XSS) vulnerability in WordPress plugin XYZ 3.1.0 allows attackers to inject arbitrary JavaScript via the search parameter.",
"SQL injection in MyApp 5.0 login form allows authentication bypass via crafted username field.",
]
embeddings = model.encode(cves)
print(embeddings.shape) # (3, 1024)
# Similarity
sim = model.similarity(embeddings, embeddings)
print(sim)
For lower-dim usage (faster retrieval at minimal quality cost), use Matryoshka truncation:
model = SentenceTransformer("oneryalcin/static-retrieval-cve-en-v1", truncate_dim=256)
embeddings = model.encode(cves)
print(embeddings.shape) # (3, 256)
Performance
Held-out CVE → same-CWE retrieval
Deterministic split: 297 queries, 5,000 corpus, seed=42, no leakage. Comparison vs the off-the-shelf sentence-transformers/static-retrieval-mrl-en-v1 (50M general-English contrastive pairs):
| Model | NDCG@10 | MRR@10 | Acc@1 |
|---|---|---|---|
Off-the-shelf static-retrieval-mrl-en-v1 |
0.3549 | 0.4979 | 0.4276 |
| This model (v1) | 0.5811 | 0.6814 | 0.6128 |
| Delta | +0.226 (+64%) | +0.184 | +0.185 |
Time-aware evaluation (true zero-shot)
Train on pre-2024 CVEs, eval on 2024+ CVEs (493 queries, 8000-doc mixed corpus). This is closer to real-world use:
| Model | NDCG@10 | Acc@1 |
|---|---|---|
| Off-the-shelf | 0.4511 | 0.5010 |
| This model | 0.7178 | 0.7241 |
The gap widens under time-aware eval — newer CVE descriptions follow more boilerplate phrasing that exactly matches this model's mined special tokens.
Per-CWE class-size breakdown
Class size (real-CVE pairs) Acc@1 NDCG@10 Hit@10
very rare (<50) 0.143 0.274 0.43
rare (50-200) 0.083 0.051 0.083
med (200-1000) 0.375 0.362 0.700
large (>=1000) 0.693 0.654 0.895
Limitation: v1 performs near-baseline on rare CWE classes (<200 pairs). If long-tail recall matters for your application, prefer oneryalcin/static-retrieval-cve-en-v2-longtail.
Training recipe
Architecture: StaticEmbedding(40522 × 1024) ≈ 41M params
Base tokenizer: google-bert/bert-base-uncased
Special tokens: 10,000 corpus-mined (uni-/bi-/tri-grams that fragment ≥3 subwords AND occur ≥5 times)
e.g. "cross-site scripting vulnerability" (3 → 1 token),
"stack-based buffer overflow" (5 → 1 token),
"x.509" (4 → 1 token), "out-of-bounds" (5 → 1 token)
Loss: MatryoshkaLoss(MultipleNegativesRankingLoss, dims=[1024,768,512,256,128,64,32])
Training pairs: 521k total
469k same-CWE pairs (5 deduplicated iterations, capped 5000/CWE)
52k CVE→CWE-definition pairs (from MITRE CWE catalog)
Batch: 512 (in-batch negatives = 511 per anchor)
Sampler: NO_DUPLICATES (mandatory for in-batch negatives)
LR: 2e-1 (high, standard for StaticEmbedding random init)
Epochs: 3 (peaked late epoch 1, oscillated thereafter)
Optimizer: AdamW + 5% linear warmup, linear decay
Precision: fp32 (MPS-safe; bf16 unreliable on Apple silicon)
Hardware: M4 Max MPS, 12 min wall
Retrieval quality vs general-purpose encoders
Same held-out CVE→same-CWE eval (297 queries, 5,000 corpus, seed=42), comparing to popular general-English sentence-transformers:
| Model | Params | Dim | NDCG@10 | MRR@10 | Acc@1 |
|---|---|---|---|---|---|
sentence-transformers/static-retrieval-mrl-en-v1 (reference static) |
31M | 1024 | 0.355 | 0.498 | 0.428 |
sentence-transformers/all-mpnet-base-v2 (typical transformer) |
110M | 768 | 0.363 | 0.487 | 0.414 |
sentence-transformers/all-MiniLM-L6-v2 (small transformer) |
22M | 384 | 0.374 | 0.505 | 0.434 |
BAAI/bge-large-en-v1.5 (strong general-purpose, SOTA-class) |
335M | 1024 | 0.408 | 0.548 | 0.488 |
oneryalcin/static-retrieval-cve-en-v1 (THIS MODEL) |
41M | 1024 | 0.581 (+42% over bge) | 0.681 | 0.613 |
oneryalcin/static-retrieval-cve-en-v2-longtail (sibling — synth + balanced sampling) |
41M | 1024 | 0.568 (+39% over bge) | 0.678 | 0.620 |
Domain training beats general-purpose architecture here. Both our 41M-param token-bag models outperform a 335M-param SOTA general-English encoder (bge-large-en-v1.5) by a wide margin on in-domain CVE retrieval — and run ~185× faster on CPU than mpnet-base-v2 (see speed table above). This is not a quality-vs-speed trade-off: for CVE retrieval by CWE, the domain-tuned static model dominates on both axes.
Pick between v1 (best aggregate NDCG, strong on common CWEs) and v2-longtail (best Acc@1, doubles rare-class precision via synthetic-data augmentation) — see comparison details below.
Benchmark reproducer: quality_bench.py in this repo.
Inference speed (CPU)
Benchmarked on Apple M4 Max CPU, encoding 1000 real CVE descriptions (avg 328 chars), batch=128, best of 3 runs:
| Model | Dim | q/s on CPU | Time for 1M CVEs | vs mpnet-base-v2 |
|---|---|---|---|---|
| This model | 1024 | ~14,000 | ~70 sec | ~185× |
sentence-transformers/static-retrieval-mrl-en-v1 (reference static) |
1024 | ~13,700 | ~73 sec | ~178× |
sentence-transformers/all-MiniLM-L6-v2 (small transformer) |
384 | ~600 | ~27 min | ~8× |
sentence-transformers/all-mpnet-base-v2 (typical transformer) |
768 | ~77 | ~3.6 hours | 1× |
Same architecture as the reference static model so the speed matches it. The entire current NVD CVE catalog (~300k entries) encodes in under 25 seconds on a single CPU core — no GPU required for deployment. (On Apple Silicon, CPU outperforms MPS for this model since StaticEmbedding is memory-bound rather than compute-bound — the GPU transfer overhead exceeds the parallelism gain.)
Benchmark reproducer: speed_bench.py in this repo.
Out-of-distribution generalization (measured)
To check whether the in-distribution wins above generalize, we ran the model on three out-of-distribution tests. Results posted honestly so you can judge fit for your use case.
Test 1: GHSA cross-corpus, summary → description (asymmetric, short-query)
GitHub Security Advisories (gh api /advisories) have a different style than NVD (Markdown sections, ~10× longer descriptions, ~64-char summaries). We sampled 10k GHSAs, took 300 short summaries as queries against a 5,000-doc corpus of GHSA descriptions, same-CWE retrieval:
| Model | NDCG@10 | Acc@1 |
|---|---|---|
sentence-transformers/static-retrieval-mrl-en-v1 (baseline) |
0.910 | 0.910 |
sentence-transformers/all-mpnet-base-v2 |
0.903 | 0.903 |
| v2-longtail (ours) | 0.887 | 0.887 |
sentence-transformers/all-MiniLM-L6-v2 |
0.873 | 0.873 |
| v1 (ours) | 0.867 | 0.867 |
Off-the-shelf baseline wins. Our domain-tuned models slot in the middle. The lift we get on in-distribution evals (NVD descriptions) does not transfer to short-query GHSA retrieval.
Reproducer: eval_ghsa.py + fetch_ghsa.py in this repo.
Test 2: Hand-built informal queries against NVD corpus
100 short informal queries representing how a security analyst, SOC operator, or threat-intel researcher would actually search. Categories: famous-vuln names (log4shell), CWE-class descriptions (SQL injection in login form), product mentions (wordpress plugin XSS), question form (how does TOCTOU work), typos. For each, an "expected" target (CVE-ID, CWE-ID, or keyword) was hand-specified. Searched against 30,000 real NVD CVEs.
| Model | Acc@1 (top-1 hit) | Hit@10 (in top-10) |
|---|---|---|
sentence-transformers/static-retrieval-mrl-en-v1 (baseline) |
0.538 | 0.721 |
sentence-transformers/all-mpnet-base-v2 |
0.538 | 0.731 |
| v1 (ours) | 0.452 | 0.654 |
| v2-longtail (ours) | 0.413 | 0.635 |
Breakdown by query type:
| Query type (n) | v1 | v2-longtail | baseline | mpnet |
|---|---|---|---|---|
| Famous vuln names ("log4shell") (27) | 0.000 | 0.000 | 0.000 | 0.000 |
| CWE-class queries ("SQL injection") (39) | 0.692 (best) | 0.692 | 0.667 | 0.615 |
| Product/keyword ("wordpress XSS") (38) | 0.421 | 0.421 | 0.789 | 0.842 |
- Famous vulnerability names: every model scores zero. Pop-culture names like "log4shell", "heartbleed", "zerologon" don't appear in CVE description text. This needs a name → CVE-ID lookup table, not embedding retrieval.
- CWE-class queries: our v2-longtail wins. Strongest for class-level semantic queries.
- Product/keyword queries: our models LOSE by 30+ percentage points. Our embedding geometry clusters by CWE-class, not by product/vendor. Asking "find wordpress vulnerabilities" returns 3-4 of 10 actual WordPress CVEs (the rest are same-vuln-class CVEs in other products). The baseline returns 7-8.
Reproducer: informal_queries.py in this repo.
Takeaway
Our model is overspecialized to NVD-style same-CWE retrieval. The specialization that lets us beat bge-large by 40% on the trained task also makes the model less useful for general security search. For programmatic CVE→CWE retrieval pipelines this is a feature. For human-typed security search this is a bug. Pick accordingly.
Reproducibility
The training script train.py is included in this repo. To reproduce from scratch:
pip install "sentence-transformers[train]>=5.5" "datasets>=2.19" "tokenizers>=0.20" "accelerate>=0.26"
python train.py
Training is deterministic (seed=42). On comparable hardware you should reach NDCG@10 within ±0.01 of 0.5811.
Companion evaluation script eval_per_cwe.py is also included — runs the per-CWE breakdown table above against any saved model.
Method notes
This model was built using the train-sentence-transformers v5.5.0 agent skill. Three findings drove the recipe:
- Corpus-mined special tokens dominate the quality gains. Adding 10k frequency-weighted tokens dropped query subword count by 12.8% and contributed most of the lift from 0.355 → 0.581 NDCG@10.
- Multi-iteration same-CWE pair construction (5 deduplicated shuffles vs single shuffle) added meaningful contrastive variance — pairs grew from 94k → 469k for the same source data.
- CWE-definition pairs (CVE description ↔ MITRE CWE name+description text) provided a "categorical anchor" signal complementing same-CWE peers.
Related artifacts
| Repo | Purpose |
|---|---|
oneryalcin/static-retrieval-cve-en-v2-longtail |
Successor with synthetic-data augmentation for rare CWEs (doubles rare-class Acc@1) |
oneryalcin/cve-synthetic-rare-cwes-v1 |
The 4,417 synthetic CVEs used to train v2-longtail |
Limitations
- Rare-class blind spot. Acc@1 on CWE classes with <200 real pairs is 0.083, essentially random.
- No negation / sentiment handling. Static embeddings cannot distinguish "exceeded expectations" from "missed expectations" — for security retrieval where directional words matter, pair with a cross-encoder reranker.
- English only. All training data is English-language CVE descriptions.
- CVE-ID strings are not whole tokens.
CVE-2021-44228still fragments — special-token mining favored class-vocab over per-CVE identifiers (which would balloon the embedding table). - Some "misses" reflect label ambiguity. CWE-189 ↔ CWE-190 and CWE-264 ↔ CWE-287 are confusion pairs where the ground-truth CWE labels themselves are partially overlapping in the MITRE hierarchy.
License
Apache-2.0 (matching the underlying dataset and CWE catalog).
Citation
@misc{oneryalcin_cve_static_p3,
author = {Mehmet Oner Yalcin},
title = {static-retrieval-cve-en-v1: a static-embedding CVE retrieval model},
year = {2026},
howpublished = {Hugging Face model},
url = {https://huggingface.co/oneryalcin/static-retrieval-cve-en-v1},
}
Model tree for oneryalcin/static-retrieval-cve-en-v1
Base model
google-bert/bert-base-uncasedDataset used to train oneryalcin/static-retrieval-cve-en-v1
Evaluation results
- NDCG@10 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)test set self-reported0.581
- MRR@10 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)test set self-reported0.681
- Accuracy@1 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)test set self-reported0.613
- NDCG@10 (time-aware, 2024+ queries) on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)test set self-reported0.718
- Accuracy@1 (time-aware, 2024+ queries) on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)test set self-reported0.724