metadata

pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers

keyphrase-mpnet-v1

This is a sentence-transformers model specialized for phrases: It maps phrases to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. In the original paper, this model is used for calculating semantic-based evaluation metrics of keyphrase models.

This model is based on sentence-transformers/all-mpnet-base-v2 and further fine-tuned on 1 million keyphrase data with SimCSE.

Citing & Authors

Paper: KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems

@article{wu2023kpeval,
      title={KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems}, 
      author={Di Wu and Da Yin and Kai-Wei Chang},
      year={2023},
      eprint={2303.15422},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
phrases = ["information retrieval", "text mining", "natural language processing"]

model = SentenceTransformer('uclanlp/keyphrase-mpnet-v1')
embeddings = model.encode(phrases)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
phrases = ["information retrieval", "text mining", "natural language processing"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('uclanlp/keyphrase-mpnet-v1')
model = AutoModel.from_pretrained('uclanlp/keyphrase-mpnet-v1')

# Tokenize sentences
encoded_input = tokenizer(phrases, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Phrase embeddings:")
print(sentence_embeddings)

Training

The model is trained on phrases from four keyphrase datasets covering a wide range of domains.

Dataset Name	Domain	Number of Phrases
KP20k	Science	715369
KPTimes	News	113456
StackEx	Online Forum	8149
OpenKP	Web	200335
Total		1030309

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 2025 with parameters:

{'batch_size': 512, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 1e-06
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 203,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 12, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)