uclanlp
/

keyphrase-mpnet-v1

@@ -12,8 +12,7 @@ tags:
 This is a [sentence-transformers](https://www.SBERT.net) model specialized for phrases: It maps phrases to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-This model is based on [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) and further fine-tuned on large-scale keyphrase data with SimCSE.
 ## Citing & Authors
 Paper: [KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems](https://arxiv.org/abs/2303.15422)
@@ -40,10 +39,10 @@ Then you can use the model like this:
 ```python
 from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
 model = SentenceTransformer('{MODEL_NAME}')
-embeddings = model.encode(sentences)
 print(embeddings)
 ```
@@ -63,14 +62,14 @@ def mean_pooling(model_output, attention_mask):
 # Sentences we want sentence embeddings for
-sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
 tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
 model = AutoModel.from_pretrained('{MODEL_NAME}')
 # Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 # Compute token embeddings
 with torch.no_grad():
@@ -86,6 +85,17 @@ print(sentence_embeddings)
 ## Training
 The model was trained with the parameters:
 **DataLoader**:
 `torch.utils.data.dataloader.DataLoader` of length 2025 with parameters:
@@ -118,7 +128,6 @@ Parameters of the fit()-Method:
 }
 ```
 ## Full Model Architecture
 ```
 SentenceTransformer(

 This is a [sentence-transformers](https://www.SBERT.net) model specialized for phrases: It maps phrases to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
+This model is based on [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) and further fine-tuned on 1 million keyphrase data with SimCSE.
 ## Citing & Authors
 Paper: [KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems](https://arxiv.org/abs/2303.15422)
 ```python
 from sentence_transformers import SentenceTransformer
+phrases = ["information retrieval", "text mining", "natural language processing"]
 model = SentenceTransformer('{MODEL_NAME}')
+embeddings = model.encode(phrases)
 print(embeddings)
 ```
 # Sentences we want sentence embeddings for
+phrases = ["information retrieval", "text mining", "natural language processing"]
 # Load model from HuggingFace Hub
 tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
 model = AutoModel.from_pretrained('{MODEL_NAME}')
 # Tokenize sentences
+encoded_input = tokenizer(phrases, padding=True, truncation=True, return_tensors='pt')
 # Compute token embeddings
 with torch.no_grad():
 ## Training
 The model was trained with the parameters:
+**Datasets**:
+| Dataset Name                                                | Number of Phrases |
+|-------------------------------------------------------------|-------------------|
+| [KP20k](https://www.aclweb.org/anthology/P17-1054/)         | 715369            |
+| [KPTimes](https://www.aclweb.org/anthology/W19-8617/)       | 113456            |
+| [StackEx](https://www.aclweb.org/anthology/2020.acl-main.710/) | 8149              |
+| [OpenKP](https://www.aclweb.org/anthology/D19-1521/)        | 200335            |
+| **Total**                                                   | **1030309**       |
+The model was trained with the parameters:
 **DataLoader**:
 `torch.utils.data.dataloader.DataLoader` of length 2025 with parameters:
 }
 ```
 ## Full Model Architecture
 ```
 SentenceTransformer(