Transformers
PyTorch
luke
Inference Endpoints
JiachengLi commited on
Commit
5cc265b
1 Parent(s): 0c08503

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +325 -3
README.md CHANGED
@@ -1,3 +1,325 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # UCTopic
5
+
6
+ This repository contains the code of model UCTopic and an easy-to-use tool UCTopicTool used for <strong>Topic Mining</strong>, <strong>Unsupervised Aspect Extractioin</strong> or <strong>Phrase Retrieval</strong>.
7
+
8
+ Our ACL 2022 paper [UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining](https://arxiv.org/abs/2202.13469).
9
+
10
+ # Quick Links
11
+
12
+ - [Overview](#overview)
13
+ - [Pretrained Model](#pretrained-model)
14
+ - [Getting Started](#getting-started)
15
+ - [UCTopic Model](#uctopic-model)
16
+ - [UCTopicTool](#uctopictool)
17
+ - [Experiments in Paper](#experiments)
18
+ - [Requirements](#requirements)
19
+ - [Datasets](#datasets)
20
+ - [Entity Clustering](#entity-clustering)
21
+ - [Topic Mining](#topic-mining)
22
+ - [Pretraining](#pretraining)
23
+ - [Contact](#contact)
24
+ - [Citation](#citation)
25
+
26
+ # Overview
27
+
28
+ We propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning(CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly.
29
+
30
+ # Pretrained Model
31
+ Our released model:
32
+ | Model | Note|
33
+ |:-------------------------------|------|
34
+ |[uctopic-base](https://drive.google.com/file/d/1XQzi4E9ctdI373CK5O-pXQyBvOONssp1/view?usp=sharing)| Pretrained UCTopic model based on [LUKE-BASE](https://arxiv.org/abs/2010.01057)
35
+
36
+ Unzip to get `uctopic-base` folder.
37
+
38
+ # Getting Started
39
+ We provide an easy-to-use phrase representation tool based on our UCTopic model. To use the tool, first install the uctopic package from PyPI
40
+ ```bash
41
+ pip install uctopic
42
+ ```
43
+ Or directly install it from our code
44
+ ```bash
45
+ python setup.py install
46
+ ```
47
+
48
+ ## UCTopic Model
49
+ After installing the package, you can load our model by just two lines of code
50
+ ```python
51
+ from uctopic import UCTopic
52
+ model = UCTopic.from_pretrained('JiachengLi/uctopic-base')
53
+ ```
54
+ The model will automatically download pre-trained parameters from [HuggingFace's models](https://huggingface.co/models). If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use `model = UCTopic.from_pretrained({PATH TO THE DOWNLOAD MODEL})`.
55
+
56
+ To get pre-trained <strong>phrase representations</strong>, our model inputs are same as [LUKE](https://huggingface.co/docs/transformers/model_doc/luke). Note: please input only <strong>ONE</strong> span each time, otherwise, will have performance decay according to our empirical results.
57
+
58
+ ```python
59
+ from uctopic import UCTopicTokenizer, UCTopic
60
+
61
+ tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
62
+ model = UCTopic.from_pretrained('JiachengLi/uctopic-base')
63
+
64
+ text = "Beyoncé lives in Los Angeles."
65
+ entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"
66
+
67
+ inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
68
+ outputs, phrase_repr = model(**inputs)
69
+ ```
70
+ `phrase_repr` is the phrase embedding (size `[768]`) of the phrase `Los Angeles`. `outputs` has the same format as the outputs from `LUKE`.
71
+
72
+ ## UCTopicTool
73
+ We provide a tool `UCTopicTool` built on `UCTopic` for efficient phrase encoding, topic mining (or unsupervised aspect extraction) or phrase retrieval.
74
+
75
+ ### Initialization
76
+
77
+ `UCTopicTool` is initialized by giving the `model_name_or_path` and `device`.
78
+ ```python
79
+ from uctopic import UCTopicTool
80
+
81
+ topic_tool = UCTopicTool('JiachengLi/uctopic-base', device='cuda:0')
82
+ ```
83
+
84
+ ### Phrase Encoding
85
+
86
+ Phrases are encoded by our method `UCTopicTool.encode` in batches, which is more efficient than `UCTopic`.
87
+ ```python
88
+ phrases = [["This place is so much bigger than others!", (0, 10)],
89
+ ["It was totally packed and loud.", (15, 21)],
90
+ ["Service was on the slower side.", (0, 7)],
91
+ ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
92
+ ["The ingredient weren't really fresh.", (4, 14)]]
93
+
94
+ embeddings = topic_tool.encode(phrases) # len(embeddings) is equal to len(phrases)
95
+ ```
96
+ **Note**: Each instance in `phrases` contains only one sentence and one span (character-level position) in format `[sentence, span]`.
97
+
98
+ Arguments for `UCTopicTool.encode` are as follows,
99
+ * **phrase** (List) - A list of `[sentence, span]` to be encoded.
100
+ * **return_numpy** (bool, *optional*, defaults to `False`) - Return `numpy.array` or `torch.Tensor`.
101
+ * **normalize_to_unit** (bool, *optional*, defaults to `True`) - Normalize all embeddings to unit vectors.
102
+ * **keepdim** (bool, *optional*, defaults to `True`) - Keep dimension size `[instance_number, hidden_size]`.
103
+ * **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch in the model.
104
+
105
+ ### Topic Mining and Unsupervised Aspect Extraction
106
+
107
+ The method `UCTopicTool.topic_mining` can mine topical phrases or conduct aspect extraction from sentences with or without spans.
108
+
109
+ ```python
110
+ sentences = ["This place is so much bigger than others!",
111
+ "It was totally packed and loud.",
112
+ "Service was on the slower side.",
113
+ "I ordered 2 mojitos: 1 lime and 1 mango.",
114
+ "The ingredient weren't really fresh."]
115
+
116
+ spans = [[(0, 10)], # This place
117
+ [(15, 21), (26, 30)], # packed; loud
118
+ [(0, 7)], # Service
119
+ [(12, 19), (21, 27), (32, 39)], # mojitos; 1 lime; 1 mango
120
+ [(4, 14)]] # ingredient
121
+ # len(sentences) is equal to len(spans)
122
+ output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
123
+ n_clusters=[15, 25])
124
+
125
+ # predict topic for new phrases
126
+ phrases = [["The food here is amazing!", (4, 8)],
127
+ ["Lovely ambiance with live music!", (21, 31)]]
128
+
129
+ topics = tool.predict_topic(phrases)
130
+ ```
131
+ **Note**: If `spans` is not given, `UCTopicTool` will extract noun phrases by [spaCy](https://spacy.io/).
132
+
133
+ Arguments for `UCTopicTool.topic_mining` are as follows,
134
+
135
+ Data arguments:
136
+ * **sentences** (List) - A List of sentences for topic mining.
137
+ * **spans** (List, *optional*, defaults to `None`) - A list of span list corresponding sentences, e.g., `[[(0, 9), (5, 7)], [(1, 2)]]` and `len(sentences)==len(spans)`. If None, automatically mine phrases from noun chunks.
138
+
139
+ Clustering arguments:
140
+ * **n_clusters** (int or List, *optional*, defaults to `2`) - The number of topics. When `n_clusters` is a list, `n_clusters[0]` and `n_clusters[1]` will be the minimum and maximum numbers to search, `n_clusters[2]` is the search step length (if not provided, default to 1).
141
+ * **meric** (str, *optional*, defaults to `"cosine"`) - The metric to measure the distance between vectors. `"cosine"` or `"euclidean"`.
142
+ * **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch for phrase encoding.
143
+ * **max_iter** (int, *optional*, defaults to `300`) - The maximum iteration number of kmeans.
144
+
145
+ CCL-finetune arguments:
146
+ * **ccl_finetune** (bool, *optional*, defaults to `True`) - Whether to conduct CCL-finetuning in the paper.
147
+ * **batch_size_finetune** (int, *optional*, defaults to `8`) - The size of mini-batch for finetuning.
148
+ * **max_finetune_num** (int, *optional*, defaults to `100000`) - The maximum number of training instances for finetuning.
149
+ * **finetune_step** (int, *optional*, defaults to `2000`) - The number of training steps for finetuning.
150
+ * **contrastive_num** (int, *optional*, defaults to `5`) - The number of negatives in contrastive learning.
151
+ * **positive_ratio** (float, *optional*, defaults to `0.1`) - The ratio of the most confident instances for finetuning.
152
+ * **n_sampling** (int, *optional*, defaults to `10000`) - The number of sampled examples for cluster number confirmation and finetuning. Set to `-1` to use the whole dataset.
153
+ * **n_workers** (int, *optional*, defaults to `8`) - The number of workers for preprocessing data.
154
+
155
+ Returns for `UCTopicTool.topic_mining` are as follows,
156
+ * **output_data** (List) - A list of sentences and corresponding phrases and topic numbers. Each element is `[sentence, [[start1, end1, topic1], [start2, end2, topic2]]]`.
157
+ * **topic_phrase_dict** (Dict) - A dictionary of topics and the list of phrases under a topic. The phrases are sorted by their confidence scores. E.g., `{topic: [[phrase1, score1], [phrase2, score2]]}`.
158
+
159
+
160
+ The method `UCTopicTool.predict_topic` predicts the topic ids for new phrases based on your training results from `UCTopicTool.topic_mining`. The inputs of `UCTopicTool.predict_topic` are same as `UCTopicTool.encode` and returns a list of topic ids (int).
161
+
162
+
163
+ ### Phrase Similarities and Retrieval
164
+
165
+ The method `UCTopicTool.similarity` compute the cosine similarities between two groups of phrases:
166
+
167
+ ```python
168
+ phrases_a = [["This place is so much bigger than others!", (0, 10)],
169
+ ["It was totally packed and loud.", (15, 21)]]
170
+
171
+ phrases_b = [["Service was on the slower side.", (0, 7)],
172
+ ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
173
+ ["The ingredient weren't really fresh.", (4, 14)]]
174
+
175
+ similarities = tool.similarity(phrases_a, phrases_b)
176
+ ```
177
+ Arguments for `UCTopicTool.similarity` are as follows,
178
+ * **queries** (List) - A list of `[sentence, span]` as queries.
179
+ * **keys** (List or `numpy.array`) - A list of `[sentence, span]` as keys or phrase representations (`numpy.array`) from `UCTopicTool.encode`.
180
+ * **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch in the model.
181
+
182
+ `UCTopicTool.similarity` returns a `numpy.array` contains the similarities between phrase pairs in two groups.
183
+
184
+
185
+ The methods `UCTopicTool.build_index` and `UCTopicTool.search` are used for phrase retrieval:
186
+ ```python
187
+ phrases = [["This place is so much bigger than others!", (0, 10)],
188
+ ["It was totally packed and loud.", (15, 21)],
189
+ ["Service was on the slower side.", (0, 7)],
190
+ ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
191
+ ["The ingredient weren't really fresh.", (4, 14)]]
192
+
193
+ # query multiple phrases
194
+ query1 = [["The food here is amazing!", (4, 8)],
195
+ ["Lovely ambiance with live music!", (21, 31)]]
196
+
197
+ # query single phrases
198
+ query2 = ["The food here is amazing!", (4, 8)]
199
+
200
+ tool.build_index(phrases)
201
+ results = tool.search(query1, top_k=3)
202
+ # or
203
+ results = tool.search(query2, top_k=3)
204
+ ```
205
+ We also support [faiss](https://github.com/facebookresearch/faiss), an efficient similarity search library. Just install the package following [instructions](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md) here and `UCTopicTool` will automatically use `faiss` for efficient search.
206
+
207
+ `UCTopicTool.search` returns the ranked top k phrases for each query.
208
+
209
+
210
+ ### Save and Load finetuned UCTopicTool
211
+
212
+ The methods `UCTopicTool.save` and `UCTopicTool.load` are used for save and load all paramters of `UCTopicTool`.
213
+
214
+ Save:
215
+ ```python
216
+ tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
217
+ # finetune UCTopic with CCL
218
+ output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
219
+ n_clusters=[15, 25])
220
+
221
+ tool.save(**your directory**)
222
+ ```
223
+
224
+ Load:
225
+ ```python
226
+ tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
227
+ tool.load(**your directory**)
228
+ ```
229
+ The loaded parameters will be used for all methods (for encoding, topic mining, phrase similarities and retrieval) introduced above.
230
+
231
+ # Experiments
232
+ In this section, we re-implement experiments in our paper.
233
+
234
+ ## Requirements
235
+ First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.9.0` version corresponding to your platforms/CUDA versions.
236
+
237
+ Then run the following script to install the remaining dependencies,
238
+ ```bash
239
+ pip install -r requirements.txt
240
+ ```
241
+
242
+ Download `en_core_web_sm` model from spacy,
243
+ ```bash
244
+ python -m spacy download en_core_web_sm
245
+ ```
246
+
247
+ ## Datasets
248
+ The downstream datasets used in our experiments can be downloaded from [here](https://drive.google.com/file/d/1dVIp9li1Wdh0JgU8slsWm0ObcitbQtSL/view?usp=sharing).
249
+
250
+ ## Entity Clustering
251
+ The config file of entity clustering is `clustering/consts.py` and most arguments are self-explained. Please setup `--gpu` and `--data_path` before running. The clustering scores will be printed.
252
+
253
+ Clustering with our pre-trained phrase embeddings.
254
+ ```bash
255
+ python clustering.py --gpu 0
256
+ ```
257
+ Clustering with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.
258
+ ```bash
259
+ python clustering_ccl_finetune.py --gpu 0
260
+ ```
261
+
262
+ ## Topic Mining
263
+ The config file of entity clustering is `topic_modeling/consts.py`.
264
+
265
+ **Key Argument Table**
266
+ | Arguments | Description |
267
+ |:-----------------|:-----------:|
268
+ | --num_classes |**Min** and **Max** number of classes, e.g., `[5, 15]`. Our model will find the class number by [silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).|
269
+ | --sample_num_cluster |Number of sampled phrases to confirm class number.|
270
+ | --sample_num_finetune|Number of sampled phrases for CCL finetuning.|
271
+ | --contrastive_num|Number of negative classes for CCL finetuning.|
272
+ | --finetune_step | CCL finetuning steps (maximum global steps for finetuning).|
273
+
274
+ **Tips**: Please tune `--batch_size` or `--contrastive_num` for suitable GPU memory usage.
275
+
276
+ Topic mining with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.
277
+ ```bash
278
+ python find_topic.py --gpu 0
279
+ ```
280
+ **Outputs**
281
+
282
+ We output three files under `topic_results`:
283
+ | File Name | Description |
284
+ |:-----------------|:-----------:|
285
+ | `merged_phraes_pred_prob.pickle` |A dictionary of phrases and their topic number and prediction probability. A topic of a phrase is merged from all phrase mentioins. `{phrase: [topic_id, probability]}`, e.g., {'fair prices': [0, 0.34889686]}|
286
+ | `phrase_instances_pred.json`| A list of all mined phrase mentions. Each element is `[[doc_id, start, end, phrase_mention], topic_id]`.|
287
+ | `topics_phrases.json`|A dictionary of topics and corresponding phrases sorted by probability. `{'topic_id': [[phrase1, prob1], [phrase2, prob2]]}`|
288
+
289
+ ### Pretraining
290
+
291
+ **Data**
292
+
293
+ For unsupervised pretraining of UCTopic, we use article and span with links from English Wikipedia and Wikidata. Our processed dataset can be downloaded from [here](https://drive.google.com/file/d/1wflsmhPI9J0ZA6aVRl2mQjHIE6JIvzAv/view?usp=sharing).
294
+
295
+ **Training scripts**
296
+
297
+ We provide example training scripts and our default training parameters for unsupervised training of UCTopic in `run_example.sh`.
298
+
299
+ ```bash
300
+ bash run_example.sh
301
+ ```
302
+
303
+ Arguments description can be found in `pretrain.py`. All the other arguments are standard Huggingface's `transformers` training arguments.
304
+
305
+ **Convert models**
306
+
307
+ Our pretrained checkpoints are slightly different from the checkpoint `uctopic-base`. Please refer `convert_uctopic_parameters.py` to convert it.
308
+
309
+ # Contact
310
+
311
+ If you have any questions related to the code or the paper, feel free to email Jiacheng (`j9li@eng.ucsd.edu`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
312
+
313
+ # Citation
314
+
315
+ Please cite our paper if you use UCTopic in your work:
316
+
317
+ ```bibtex
318
+ @article{Li2022UCTopicUC,
319
+ title={UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining},
320
+ author={Jiacheng Li and Jingbo Shang and Julian McAuley},
321
+ journal={ArXiv},
322
+ year={2022},
323
+ volume={abs/2202.13469}
324
+ }
325
+ ```