MiniLM-searchkeys

MiniLM-searchkeys is a sentence-transformers encoder fine-tuned for multi-domain keyphrase ranking as part of SearchKeys, a retrieval-augmented encoder-only alternative to seq2seq keyphrase prediction. It maps documents and candidate keyphrases into a shared 384-dimensional space, where cosine similarity reflects keyphrase relevance — including for absent keyphrases (phrases that do not literally appear in the source text).

This model is the fine-tuned encoder described in:

Saber Zahhar, Nédra Mellouli, Christophe Rodrigues, Nicolas Travers. Multi-Domain Keyphrase Prediction via Retrieval-Augmented Ranking: A Resource-Efficient Alternative to Seq2Seq Generation. DKE 2026.

Code and full pipeline: github.com/saberzahhar/dke2026kp

How it works: SearchKeys

Instead of generating keyphrases token-by-token, SearchKeys retrieves and ranks:

  1. Candidate pooling — a test document is queried (via BM25) against an indexed training corpus, and the gold keyphrases of the top-D nearest training documents are pooled as candidates.
  2. Semantic ranking — this model embeds the test document and every candidate keyphrase, then candidates are ranked by cosine similarity (optionally weighted by candidate recurrence frequency across the retrieved pool).

Because candidates come from a real annotated corpus rather than a vocabulary distribution, this approach is able to surface relevant keyphrases that are absent from the source document — something generative decoders struggle with.

Training procedure

This model starts from the pretrained sentence-transformers/all-MiniLM-L12-v2 checkpoint and is fine-tuned with a contrastive, multi-label objective tailored to keyphrase ranking.

  • Loss: Multi-Label Supervised Contrastive Learning (MulSupCon), which constructs one positive set per gold keyphrase label rather than a single anchor–positive–negative triplet, so documents sharing more keyphrases pull on each other proportionally harder. This consistently outperformed a masked Multiple Negatives Ranking (mMNR) alternative, especially on absent keyphrases.
  • Batching: retrieval-guided batches of 384 documents, built by iteratively adding the nearest unselected neighbours of a seed document, to form semantically coherent batches (and therefore more informative in-batch negatives/positives).
  • Curriculum: batches presented in order of average loss (easy → hard) outperformed random batch ordering.
  • Schedule: 10 epochs, AdamW, learning rate 2e-5, weight decay 0.01, cosine LR decay, warm-up ratio 0.1, max sequence length 512 tokens, contrastive temperature 0.05.
  • Checkpoint selection: validation F1@5 plateaus around epoch 7, which is the checkpoint released here.
  • Compute: fine-tuning took ~4.5 hours/epoch on a single NVIDIA Tesla V100 32GB.

Training data

Fine-tuning combines three multi-domain keyphrase datasets, used jointly (not domain-specialized):

Dataset Domain Train docs
kp20k Computer science (ACM DL, ScienceDirect, Wiley, etc.) 530.8k
kpbiomed Biomedical (PubMed) 500k
kptimes News (NYTimes / Japan Times) 259.9k

Intended uses

This model is intended to be used as the ranking encoder in a retrieval-augmented keyphrase prediction pipeline, not as a general-purpose sentence embedder. Given a document and a pool of candidate keyphrases (e.g. pooled via BM25 retrieval from an annotated corpus), it produces embeddings whose cosine similarity is a strong relevance signal for both present and absent keyphrases.

It is best paired with:

  • a lexical retriever (BM25 performed best as the retrieval backbone in our experiments, outperforming dense retrievers including all-MiniLM-L12-v2 and mxbai-embed-large-v1 on candidate recall);
  • a retrieval depth of D≈7 retrieved neighbours for candidate pooling;
  • cosine similarity weighted by candidate frequency across the retrieved pool as the final scoring function (no diversification/MMR penalty).

By default, input text longer than 512 word pieces is truncated.

Usage (Sentence-Transformers)

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("sabsab129/MiniLM-searchkeys")

document = "A feedback vertex set of 2-degenerate graphs..."
candidates = [
    "feedback vertex set",
    "decycling set",
    "2-degenerate graphs",
    "rank-width",
    "fixed-parameter algorithm",
]

doc_embedding = model.encode(document)
candidate_embeddings = model.encode(candidates)

scores = cos_sim(doc_embedding, candidate_embeddings)[0]
ranked = sorted(zip(candidates, scores.tolist()), key=lambda x: -x[1])
print(ranked)

Citation

If you use this model, please cite the paper:

@inproceedings{zahhar2026searchkeys,
  title     = {Multi-Domain Keyphrase Prediction via Retrieval-Augmented Ranking: A Resource-Efficient Alternative to Seq2Seq Generation},
  author    = {Zahhar, Saber and Mellouli, N{\'e}dra and Rodrigues, Christophe and Travers, Nicolas},
  booktitle = {DKE 2026},
  year      = {2026}
}

Acknowledgements

This model is fine-tuned from sentence-transformers/all-MiniLM-L12-v2, originally developed by the Sentence-Transformers team during the Hugging Face JAX/Flax Community Week, based on microsoft/MiniLM-L12-H384-uncased.

Downloads last month
39
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sabsab129/MiniLM-searchkeys

Datasets used to train sabsab129/MiniLM-searchkeys