--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers --- # whaleloops/phrase-bert This is the official repository for the EMNLP 2021 long paper [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304). We provide [code](https://github.com/sf-wa-326/phrase-bert-topic-model) for training and evaluating Phrase-BERT in addition to the datasets used in the paper. ## Usage (Sentence-Transformers) Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: ``` pip install -U sentence-transformers ``` Our model is tested on pytorch=1.9.0, tranformers=4.8.1, sentence-tranformers = 2.1.0 TODO Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle'] model = SentenceTransformer('whaleloops/phrase-bert') phrase_embs = model.encode( phrase_list ) [p1, p2, p3] = phrase_embs ``` As in sentence-BERT, the default output is a list of numpy arrays: ```` for phrase, embedding in zip(phrase_list, phrase_embs): print("Phrase:", phrase) print("Embedding:", embedding) print("") ```` An example of computing the dot product of phrase embeddings: ```` import numpy as np print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}') print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}') print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}') ```` An example of computing cosine similarity of phrase embeddings: ```` import torch from torch import nn cos_sim = nn.CosineSimilarity(dim=0) print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}') print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}') print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}') ```` The output should look like: ```` The dot product between phrase 1 and 2 is: 218.43600463867188 The dot product between phrase 1 and 3 is: 165.48483276367188 The dot product between phrase 2 and 3 is: 160.51708984375 The cosine similarity between phrase 1 and 2 is: 0.8142536282539368 The cosine similarity between phrase 1 and 3 is: 0.6130303144454956 The cosine similarity between phrase 2 and 3 is: 0.584893524646759 ```` ## Evaluation Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper: * Turney [[Download](https://storage.googleapis.com/phrase-bert/turney/data.txt) ] * BiRD [[Download](https://storage.googleapis.com/phrase-bert/bird/data.txt)] * PPDB [[Download](https://storage.googleapis.com/phrase-bert/ppdb/examples.json)] * PPDB-filtered [[Download](https://storage.googleapis.com/phrase-bert/ppdb_exact/examples.json)] * PAWS-short [[Download Train-split](https://storage.googleapis.com/phrase-bert/paws_short/train_examples.json) ] [[Download Dev-split](https://storage.googleapis.com/phrase-bert/paws_short/dev_examples.json) ] [[Download Test-split](https://storage.googleapis.com/phrase-bert/paws_short/test_examples.json) ] Change `config/model_path.py` with the model path according to your directories and * For evaluation on Turney, run `python eval_turney.py` * For evaluation on BiRD, run `python eval_bird.py` * for evaluation on PPDB / PPDB-filtered / PAWS-short, run `eval_ppdb_paws.py` with: ```` nohup python -u eval_ppdb_paws.py \ --full_run_mode \ --task \ --data_dir \ --result_dir \ >./output.txt 2>&1 & ```` ## Train your own Phrase-BERT If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to `phrase-bert/phrase_bert_finetune.py` The datasets we used to fine-tune Phrase-BERT are here: [training data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_train.csv) and [validation data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_valid.csv). To re-produce the trained Phrase-BERT, please run: export INPUT_DATA_PATH= export TRAIN_DATA_FILE= export VALID_DATA_FILE= export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens export OUTPUT_MODEL_PATH= python -u phrase_bert_finetune.py \ --input_data_path $INPUT_DATA_PATH \ --train_data_file $TRAIN_DATA_FILE \ --valid_data_file $VALID_DATA_FILE \ --input_model_path $INPUT_MODEL_PATH \ --output_model_path $OUTPUT_MODEL_PATH ## Citation: Please cite us if you find this useful: ```` @inproceedings{phrasebertwang2021, author={Shufan Wang and Laure Thompson and Mohit Iyyer}, Booktitle = {Empirical Methods in Natural Language Processing}, Year = "2021", Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration} } ````