whaleloops
/

phrase-bert

@@ -9,9 +9,9 @@ tags:
 # whaleloops/phrase-bert
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-<!--- Describe your model here -->
 ## Usage (Sentence-Transformers)
@@ -25,69 +25,104 @@ Then you can use the model like this:
 ```python
 from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
 model = SentenceTransformer('whaleloops/phrase-bert')
-embeddings = model.encode(sentences)
-print(embeddings)
 ```
-## Usage (HuggingFace Transformers)
-Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
-```python
-from transformers import AutoTokenizer, AutoModel
-import torch
-#Mean Pooling - Take attention mask into account for correct averaging
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-# Sentences we want sentence embeddings for
-sentences = ['This is an example sentence', 'Each sentence is converted']
-# Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('whaleloops/phrase-bert')
-model = AutoModel.from_pretrained('whaleloops/phrase-bert')
-# Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-# Compute token embeddings
-with torch.no_grad():
-    model_output = model(**encoded_input)
-# Perform pooling. In this case, mean pooling.
-sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
-print("Sentence embeddings:")
-print(sentence_embeddings)
-```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=whaleloops/phrase-bert)
-## Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 128, 'do_lower_case': None}) with Transformer model: BertModel
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
-)
-```
-## Citing & Authors
-<!--- Describe where people can find more information -->

 # whaleloops/phrase-bert
+This is the official repository for the EMNLP 2021 long paper [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304). We provide [code](https://github.com/sf-wa-326/phrase-bert-topic-model) for training and evaluating Phrase-BERT in addition to the datasets used in the paper.
 ## Usage (Sentence-Transformers)
 ```python
 from sentence_transformers import SentenceTransformer
+phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
 model = SentenceTransformer('whaleloops/phrase-bert')
+phrase_embs = model.encode( phrase_list )
+[p1, p2, p3] = phrase_embs
 ```
+As in sentence-BERT, the default output is a list of numpy arrays:
+````
+for phrase, embedding in zip(phrase_list, phrase_embs):
+    print("Phrase:", phrase)
+    print("Embedding:", embedding)
+    print("")
+````
+An example of computing the dot product of phrase embeddings:
+````
+import numpy as np
+print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
+print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
+print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')
+````
+An example of computing cosine similarity of phrase embeddings:
+````
+import torch
+from torch import nn
+cos_sim = nn.CosineSimilarity(dim=0)
+print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
+print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
+print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')
+````
+The output should look like:
+````
+The dot product between phrase 1 and 2 is: 218.43600463867188
+The dot product between phrase 1 and 3 is: 165.48483276367188
+The dot product between phrase 2 and 3 is: 160.51708984375
+The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
+The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
+The cosine similarity between phrase 2 and 3 is: 0.584893524646759
+````
+## Evaluation
+Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:
+* Turney [[Download](https://storage.googleapis.com/phrase-bert/turney/data.txt) ]
+* BiRD [[Download](https://storage.googleapis.com/phrase-bert/bird/data.txt)]
+* PPDB [[Download](https://storage.googleapis.com/phrase-bert/ppdb/examples.json)]
+* PPDB-filtered [[Download](https://storage.googleapis.com/phrase-bert/ppdb_exact/examples.json)]
+* PAWS-short [[Download Train-split](https://storage.googleapis.com/phrase-bert/paws_short/train_examples.json) ] [[Download Dev-split](https://storage.googleapis.com/phrase-bert/paws_short/dev_examples.json) ] [[Download Test-split](https://storage.googleapis.com/phrase-bert/paws_short/test_examples.json) ]
+Change `config/model_path.py` with the model path according to your directories and
+* For evaluation on Turney, run `python eval_turney.py`
+* For evaluation on BiRD, run `python eval_bird.py`
+* for evaluation on PPDB / PPDB-filtered / PAWS-short, run `eval_ppdb_paws.py` with:
+    ````
+    nohup python  -u eval_ppdb_paws.py \
+        --full_run_mode \
+        --task <task-name> \
+        --data_dir <input-data-dir> \
+        --result_dir <result-storage-dr> \
+        >./output.txt 2>&1 &
+    ````
+## Train your own Phrase-BERT
+If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to
+`phrase-bert/phrase_bert_finetune.py`
+The datasets we used to fine-tune Phrase-BERT are here: [training data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_train.csv) and [validation data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_valid.csv).
+To re-produce the trained Phrase-BERT, please run:
+    export INPUT_DATA_PATH=<directory-of-phrasebert-finetuning-data>
+    export TRAIN_DATA_FILE=<training-data-filename.csv>
+    export VALID_DATA_FILE=<validation-data-filename.csv>
+    export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens
+    export OUTPUT_MODEL_PATH=<directory-of-saved-model>
+    python -u phrase_bert_finetune.py \
+        --input_data_path $INPUT_DATA_PATH \
+        --train_data_file $TRAIN_DATA_FILE \
+        --valid_data_file $VALID_DATA_FILE \
+        --input_model_path $INPUT_MODEL_PATH \
+        --output_model_path $OUTPUT_MODEL_PATH
+## Citation:
+Please cite us if you find this useful:
+````
+@inproceedings{phrasebertwang2021,
+    author={Shufan Wang and Laure Thompson and Mohit Iyyer},
+    Booktitle = {Empirical Methods in Natural Language Processing},
+    Year = "2021",
+    Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
+}
+````