zhichao yang commited on
Commit
1cb1b74
1 Parent(s): be4fdf4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -64
README.md CHANGED
@@ -9,9 +9,9 @@ tags:
9
 
10
  # whaleloops/phrase-bert
11
 
12
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
13
 
14
- <!--- Describe your model here -->
15
 
16
  ## Usage (Sentence-Transformers)
17
 
@@ -25,69 +25,104 @@ Then you can use the model like this:
25
 
26
  ```python
27
  from sentence_transformers import SentenceTransformer
28
- sentences = ["This is an example sentence", "Each sentence is converted"]
29
 
30
  model = SentenceTransformer('whaleloops/phrase-bert')
31
- embeddings = model.encode(sentences)
32
- print(embeddings)
33
  ```
34
 
35
-
36
-
37
- ## Usage (HuggingFace Transformers)
38
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
39
-
40
- ```python
41
- from transformers import AutoTokenizer, AutoModel
42
- import torch
43
-
44
-
45
- #Mean Pooling - Take attention mask into account for correct averaging
46
- def mean_pooling(model_output, attention_mask):
47
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
48
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
49
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
50
-
51
-
52
- # Sentences we want sentence embeddings for
53
- sentences = ['This is an example sentence', 'Each sentence is converted']
54
-
55
- # Load model from HuggingFace Hub
56
- tokenizer = AutoTokenizer.from_pretrained('whaleloops/phrase-bert')
57
- model = AutoModel.from_pretrained('whaleloops/phrase-bert')
58
-
59
- # Tokenize sentences
60
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
61
-
62
- # Compute token embeddings
63
- with torch.no_grad():
64
- model_output = model(**encoded_input)
65
-
66
- # Perform pooling. In this case, mean pooling.
67
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
68
-
69
- print("Sentence embeddings:")
70
- print(sentence_embeddings)
71
- ```
72
-
73
-
74
-
75
- ## Evaluation Results
76
-
77
- <!--- Describe how your model was evaluated -->
78
-
79
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=whaleloops/phrase-bert)
80
-
81
-
82
-
83
- ## Full Model Architecture
84
- ```
85
- SentenceTransformer(
86
- (0): Transformer({'max_seq_length': 128, 'do_lower_case': None}) with Transformer model: BertModel
87
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
88
- )
89
- ```
90
-
91
- ## Citing & Authors
92
-
93
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  # whaleloops/phrase-bert
11
 
12
+ This is the official repository for the EMNLP 2021 long paper [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304). We provide [code](https://github.com/sf-wa-326/phrase-bert-topic-model) for training and evaluating Phrase-BERT in addition to the datasets used in the paper.
13
+
14
 
 
15
 
16
  ## Usage (Sentence-Transformers)
17
 
 
25
 
26
  ```python
27
  from sentence_transformers import SentenceTransformer
28
+ phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
29
 
30
  model = SentenceTransformer('whaleloops/phrase-bert')
31
+ phrase_embs = model.encode( phrase_list )
32
+ [p1, p2, p3] = phrase_embs
33
  ```
34
 
35
+ As in sentence-BERT, the default output is a list of numpy arrays:
36
+ ````
37
+ for phrase, embedding in zip(phrase_list, phrase_embs):
38
+ print("Phrase:", phrase)
39
+ print("Embedding:", embedding)
40
+ print("")
41
+ ````
42
+
43
+ An example of computing the dot product of phrase embeddings:
44
+ ````
45
+ import numpy as np
46
+ print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
47
+ print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
48
+ print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')
49
+ ````
50
+
51
+ An example of computing cosine similarity of phrase embeddings:
52
+ ````
53
+ import torch
54
+ from torch import nn
55
+ cos_sim = nn.CosineSimilarity(dim=0)
56
+ print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
57
+ print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
58
+ print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')
59
+ ````
60
+
61
+ The output should look like:
62
+ ````
63
+ The dot product between phrase 1 and 2 is: 218.43600463867188
64
+ The dot product between phrase 1 and 3 is: 165.48483276367188
65
+ The dot product between phrase 2 and 3 is: 160.51708984375
66
+ The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
67
+ The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
68
+ The cosine similarity between phrase 2 and 3 is: 0.584893524646759
69
+ ````
70
+
71
+
72
+
73
+ ## Evaluation
74
+ Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:
75
+
76
+ * Turney [[Download](https://storage.googleapis.com/phrase-bert/turney/data.txt) ]
77
+ * BiRD [[Download](https://storage.googleapis.com/phrase-bert/bird/data.txt)]
78
+ * PPDB [[Download](https://storage.googleapis.com/phrase-bert/ppdb/examples.json)]
79
+ * PPDB-filtered [[Download](https://storage.googleapis.com/phrase-bert/ppdb_exact/examples.json)]
80
+ * PAWS-short [[Download Train-split](https://storage.googleapis.com/phrase-bert/paws_short/train_examples.json) ] [[Download Dev-split](https://storage.googleapis.com/phrase-bert/paws_short/dev_examples.json) ] [[Download Test-split](https://storage.googleapis.com/phrase-bert/paws_short/test_examples.json) ]
81
+
82
+
83
+ Change `config/model_path.py` with the model path according to your directories and
84
+ * For evaluation on Turney, run `python eval_turney.py`
85
+ * For evaluation on BiRD, run `python eval_bird.py`
86
+ * for evaluation on PPDB / PPDB-filtered / PAWS-short, run `eval_ppdb_paws.py` with:
87
+
88
+ ````
89
+ nohup python -u eval_ppdb_paws.py \
90
+ --full_run_mode \
91
+ --task <task-name> \
92
+ --data_dir <input-data-dir> \
93
+ --result_dir <result-storage-dr> \
94
+ >./output.txt 2>&1 &
95
+ ````
96
+
97
+ ## Train your own Phrase-BERT
98
+ If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to
99
+ `phrase-bert/phrase_bert_finetune.py`
100
+
101
+ The datasets we used to fine-tune Phrase-BERT are here: [training data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_train.csv) and [validation data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_valid.csv).
102
+
103
+ To re-produce the trained Phrase-BERT, please run:
104
+
105
+ export INPUT_DATA_PATH=<directory-of-phrasebert-finetuning-data>
106
+ export TRAIN_DATA_FILE=<training-data-filename.csv>
107
+ export VALID_DATA_FILE=<validation-data-filename.csv>
108
+ export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens
109
+ export OUTPUT_MODEL_PATH=<directory-of-saved-model>
110
+
111
+
112
+ python -u phrase_bert_finetune.py \
113
+ --input_data_path $INPUT_DATA_PATH \
114
+ --train_data_file $TRAIN_DATA_FILE \
115
+ --valid_data_file $VALID_DATA_FILE \
116
+ --input_model_path $INPUT_MODEL_PATH \
117
+ --output_model_path $OUTPUT_MODEL_PATH
118
+
119
+ ## Citation:
120
+ Please cite us if you find this useful:
121
+ ````
122
+ @inproceedings{phrasebertwang2021,
123
+ author={Shufan Wang and Laure Thompson and Mohit Iyyer},
124
+ Booktitle = {Empirical Methods in Natural Language Processing},
125
+ Year = "2021",
126
+ Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
127
+ }
128
+ ````