Transformers
PyTorch
luke
Inference Endpoints
File size: 16,376 Bytes
5cc265b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f67e66
 
 
 
 
 
 
 
 
 
 
 
 
5cc265b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
---
license: mit
---
# UCTopic

This repository contains the code of model UCTopic and an easy-to-use tool UCTopicTool used for <strong>Topic Mining</strong>, <strong>Unsupervised Aspect Extractioin</strong> or <strong>Phrase Retrieval</strong>.

Our ACL 2022 paper [UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining](https://arxiv.org/abs/2202.13469).

# Quick Links

  - [Overview](#overview)
  - [Pretrained Model](#pretrained-model)
  - [Getting Started](#getting-started)
    - [UCTopic Model](#uctopic-model)
    - [UCTopicTool](#uctopictool)
  - [Experiments in Paper](#experiments)
    - [Requirements](#requirements)
    - [Datasets](#datasets)
    - [Entity Clustering](#entity-clustering)
    - [Topic Mining](#topic-mining)
  - [Pretraining](#pretraining)
  - [Contact](#contact)
  - [Citation](#citation)

# Overview

We propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning(CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly.

# Pretrained Model
Our released model:
|              Model              | Note|
|:-------------------------------|------|
|[uctopic-base](https://drive.google.com/file/d/1XQzi4E9ctdI373CK5O-pXQyBvOONssp1/view?usp=sharing)| Pretrained UCTopic model based on [LUKE-BASE](https://arxiv.org/abs/2010.01057)

Unzip to get `uctopic-base` folder.

# Getting Started
We provide an easy-to-use phrase representation tool based on our UCTopic model. To use the tool, first install the uctopic package from PyPI
```bash
pip install uctopic
```
Or directly install it from our code
```bash
python setup.py install
```

## UCTopic Model
After installing the package, you can load our model by just two lines of code
```python
from uctopic import UCTopic
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')
```
The model will automatically download pre-trained parameters from [HuggingFace's models](https://huggingface.co/models). If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use `model = UCTopic.from_pretrained({PATH TO THE DOWNLOAD MODEL})`.

To get pre-trained <strong>phrase representations</strong>, our model inputs are same as [LUKE](https://huggingface.co/docs/transformers/model_doc/luke). Note: please input only <strong>ONE</strong> span each time, otherwise, will have performance decay according to our empirical results.

```python
from uctopic import UCTopicTokenizer, UCTopic

tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')

text = "Beyoncé lives in Los Angeles."
entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"

inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
outputs, phrase_repr = model(**inputs)
```
`phrase_repr` is the phrase embedding (size `[768]`) of the phrase `Los Angeles`. `outputs` has the same format as the outputs from `LUKE`.

## UCTopicTool
We provide a tool `UCTopicTool` built on `UCTopic` for efficient phrase encoding, topic mining (or unsupervised aspect extraction) or phrase retrieval.

### Initialization

`UCTopicTool` is initialized by giving the `model_name_or_path` and `device`.
```python
from uctopic import UCTopicTool

topic_tool = UCTopicTool('JiachengLi/uctopic-base', device='cuda:0')
```

### Phrase Encoding

Phrases are encoded by our method `UCTopicTool.encode` in batches, which is more efficient than `UCTopic`.
```python
phrases = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)],
           ["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

embeddings = topic_tool.encode(phrases) # len(embeddings) is equal to len(phrases)
```
**Note**: Each instance in `phrases` contains only one sentence and one span (character-level position) in format `[sentence, span]`.

Arguments for `UCTopicTool.encode` are as follows,
* **phrase** (List) - A list of `[sentence, span]` to be encoded.
* **return_numpy** (bool, *optional*, defaults to `False`) - Return `numpy.array` or `torch.Tensor`.
* **normalize_to_unit** (bool, *optional*, defaults to `True`) - Normalize all embeddings to unit vectors.
* **keepdim** (bool, *optional*, defaults to `True`) - Keep dimension size `[instance_number, hidden_size]`.
* **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch in the model.

### Topic Mining and Unsupervised Aspect Extraction

The method `UCTopicTool.topic_mining` can mine topical phrases or conduct aspect extraction from sentences with or without spans.

```python
sentences = ["This place is so much bigger than others!",
             "It was totally packed and loud.",
             "Service was on the slower side.",
             "I ordered 2 mojitos: 1 lime and 1 mango.",
             "The ingredient weren't really fresh."]

spans = [[(0, 10)],                       # This place
         [(15, 21), (26, 30)],            # packed; loud
         [(0, 7)],                        # Service
         [(12, 19), (21, 27), (32, 39)],  # mojitos; 1 lime; 1 mango
         [(4, 14)]]                       # ingredient
# len(sentences) is equal to len(spans)
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
                                                   n_clusters=[15, 25])

# predict topic for new phrases
phrases = [["The food here is amazing!", (4, 8)],
           ["Lovely ambiance with live music!", (21, 31)]]

topics = tool.predict_topic(phrases)
```
**Note**: If `spans` is not given, `UCTopicTool` will extract noun phrases by [spaCy](https://spacy.io/).

Arguments for `UCTopicTool.topic_mining` are as follows,

Data arguments:
* **sentences** (List) - A List of sentences for topic mining.
* **spans** (List, *optional*, defaults to `None`) - A list of span list corresponding sentences, e.g., `[[(0, 9), (5, 7)], [(1, 2)]]` and `len(sentences)==len(spans)`. If None, automatically mine phrases from noun chunks.

Clustering arguments:
* **n_clusters** (int or List, *optional*, defaults to `2`) - The number of topics. When `n_clusters` is a list, `n_clusters[0]` and `n_clusters[1]` will be the minimum and maximum numbers to search, `n_clusters[2]` is the search step length (if not provided, default to 1).
* **meric** (str, *optional*, defaults to `"cosine"`) - The metric to measure the distance between vectors. `"cosine"` or `"euclidean"`.
* **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch for phrase encoding.
* **max_iter** (int, *optional*, defaults to `300`) - The maximum iteration number of kmeans.
        
CCL-finetune arguments:
* **ccl_finetune** (bool, *optional*, defaults to `True`) - Whether to conduct CCL-finetuning in the paper.
* **batch_size_finetune** (int, *optional*, defaults to `8`) - The size of mini-batch for finetuning.
* **max_finetune_num** (int, *optional*, defaults to `100000`) - The maximum number of training instances for finetuning.
* **finetune_step** (int, *optional*, defaults to `2000`) - The number of training steps for finetuning.
* **contrastive_num** (int, *optional*, defaults to `5`) - The number of negatives in contrastive learning.
* **positive_ratio** (float, *optional*, defaults to `0.1`) - The ratio of the most confident instances for finetuning.
* **n_sampling** (int, *optional*, defaults to `10000`) - The number of sampled examples for cluster number confirmation and finetuning. Set to `-1` to use the whole dataset.
* **n_workers** (int, *optional*, defaults to `8`) - The number of workers for preprocessing data.

Returns for `UCTopicTool.topic_mining` are as follows,
* **output_data** (List) - A list of sentences and corresponding phrases and topic numbers. Each element is `[sentence, [[start1, end1, topic1], [start2, end2, topic2]]]`.
* **topic_phrase_dict** (Dict) - A dictionary of topics and the list of phrases under a topic. The phrases are sorted by their confidence scores. E.g., `{topic: [[phrase1, score1], [phrase2, score2]]}`.


The method `UCTopicTool.predict_topic` predicts the topic ids for new phrases based on your training results from `UCTopicTool.topic_mining`. The inputs of `UCTopicTool.predict_topic` are same as `UCTopicTool.encode` and returns a list of topic ids (int).


### Phrase Similarities and Retrieval

The method `UCTopicTool.similarity` compute the cosine similarities between two groups of phrases:

```python
phrases_a = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)]]

phrases_b = [["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

similarities = tool.similarity(phrases_a, phrases_b)
```
Arguments for `UCTopicTool.similarity` are as follows,
* **queries** (List) - A list of `[sentence, span]` as queries.
* **keys** (List or `numpy.array`) - A list of `[sentence, span]` as keys or phrase representations (`numpy.array`) from `UCTopicTool.encode`.
* **batch_size** (int, *optional*, defaults to `64`) - The size of mini-batch in the model.

`UCTopicTool.similarity` returns a `numpy.array` contains the similarities between phrase pairs in two groups.


The methods `UCTopicTool.build_index` and `UCTopicTool.search` are used for phrase retrieval:
```python
phrases = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)],
           ["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

# query multiple phrases
query1 = [["The food here is amazing!", (4, 8)],
           ["Lovely ambiance with live music!", (21, 31)]]  

# query single phrases
query2 = ["The food here is amazing!", (4, 8)]

tool.build_index(phrases)
results = tool.search(query1, top_k=3)
# or
results = tool.search(query2, top_k=3)
```
We also support [faiss](https://github.com/facebookresearch/faiss), an efficient similarity search library. Just install the package following [instructions](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md) here and `UCTopicTool` will automatically use `faiss` for efficient search.

`UCTopicTool.search` returns the ranked top k phrases for each query.


### Save and Load finetuned UCTopicTool

The methods `UCTopicTool.save` and `UCTopicTool.load` are used for save and load all paramters of `UCTopicTool`.

Save:
```python
tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
# finetune UCTopic with CCL
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
                                                   n_clusters=[15, 25])

tool.save(**your directory**)
```

Load:
```python
tool = UCTopicTool('JiachengLi/uctopic-base', 'cuda:0')
tool.load(**your directory**)
```
The loaded parameters will be used for all methods (for encoding, topic mining, phrase similarities and retrieval) introduced above.

# Experiments
In this section, we re-implement experiments in our paper.

## Requirements
First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.9.0` version corresponding to your platforms/CUDA versions.

Then run the following script to install the remaining dependencies,
```bash
pip install -r requirements.txt
```

Download `en_core_web_sm` model from spacy,
```bash
python -m spacy download en_core_web_sm
```

## Datasets
The downstream datasets used in our experiments can be downloaded from [here](https://drive.google.com/file/d/1dVIp9li1Wdh0JgU8slsWm0ObcitbQtSL/view?usp=sharing).

## Entity Clustering
The config file of entity clustering is `clustering/consts.py` and most arguments are self-explained. Please setup `--gpu` and `--data_path` before running. The clustering scores will be printed.

Clustering with our pre-trained phrase embeddings.
```bash
python clustering.py --gpu 0
```
Clustering with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.
```bash
python clustering_ccl_finetune.py --gpu 0
```

## Topic Mining
The config file of entity clustering is `topic_modeling/consts.py`.

**Key Argument Table**
|    Arguments     | Description |
|:-----------------|:-----------:|
| --num_classes     |**Min** and **Max** number of classes, e.g., `[5, 15]`. Our model will find the class number by [silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).|
| --sample_num_cluster   |Number of sampled phrases to confirm class number.|
| --sample_num_finetune|Number of sampled phrases for CCL finetuning.|
| --contrastive_num|Number of negative classes for CCL finetuning.|
| --finetune_step | CCL finetuning steps (maximum global steps for finetuning).|

**Tips**: Please tune `--batch_size` or `--contrastive_num` for suitable GPU memory usage.

Topic mining with our pre-trained phrase embeddings and Cluster-Assisted Constrastive Learning (CCL) proposed in our paper.
```bash
python find_topic.py --gpu 0
```
**Outputs**

We output three files under `topic_results`:
|    File Name     | Description |
|:-----------------|:-----------:|
| `merged_phraes_pred_prob.pickle` |A dictionary of phrases and their topic number and prediction probability. A topic of a phrase is merged from all phrase mentioins. `{phrase: [topic_id, probability]}`, e.g., {'fair prices': [0, 0.34889686]}|
| `phrase_instances_pred.json`| A list of all mined phrase mentions. Each element is `[[doc_id, start, end, phrase_mention], topic_id]`.|
| `topics_phrases.json`|A dictionary of topics and corresponding phrases sorted by probability. `{'topic_id': [[phrase1, prob1], [phrase2, prob2]]}`|

### Pretraining

**Data**

For unsupervised pretraining of UCTopic, we use article and span with links from English Wikipedia and Wikidata. Our processed dataset can be downloaded from [here](https://drive.google.com/file/d/1wflsmhPI9J0ZA6aVRl2mQjHIE6JIvzAv/view?usp=sharing).

**Training scripts**

We provide example training scripts and our default training parameters for unsupervised training of UCTopic in `run_example.sh`.

```bash
bash run_example.sh
```

Arguments description can be found in `pretrain.py`. All the other arguments are standard Huggingface's `transformers` training arguments.

**Convert models**

Our pretrained checkpoints are slightly different from the checkpoint `uctopic-base`. Please refer `convert_uctopic_parameters.py` to convert it.

# Contact

If you have any questions related to the code or the paper, feel free to email Jiacheng (`j9li@eng.ucsd.edu`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

# Citation

Please cite our paper if you use UCTopic in your work:

```bibtex
@inproceedings{Li2022UCTopicUC,
    title = "{UCT}opic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining",
    author = "Li, Jiacheng  and
      Shang, Jingbo  and
      McAuley, Julian",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.426",
    doi = "10.18653/v1/2022.acl-long.426",
    pages = "6159--6169"
}
```