MaartenGr commited on
Commit
af53722
1 Parent(s): f1a31f1

Update readme

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md CHANGED
@@ -10,6 +10,12 @@ library_name: bertopic
10
  This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
11
  BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
12
 
 
 
 
 
 
 
13
  ## Usage
14
 
15
  To use this model, please install BERTopic:
@@ -27,6 +33,12 @@ topic_model = BERTopic.load("MaartenGr/BERTopic_Multimodal")
27
  topic_model.get_topic_info()
28
  ```
29
 
 
 
 
 
 
 
30
  ## Topic overview
31
 
32
  * Number of topics: 29
@@ -69,6 +81,60 @@ topic_model.get_topic_info()
69
 
70
  </details>
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ## Training hyperparameters
73
 
74
  * calculate_probabilities: False
 
10
  This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
11
  BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
12
 
13
+ This model was trained on 8000 images from Flickr **without** the captions. This demonstrates how BERTopic can be used for topic modeling using images as input only.
14
+
15
+ A few examples of generated topics:
16
+
17
+ !["multimodal.png"](multimodal.png)
18
+
19
  ## Usage
20
 
21
  To use this model, please install BERTopic:
 
33
  topic_model.get_topic_info()
34
  ```
35
 
36
+ You can view all information about a topic as follows:
37
+
38
+ ```python
39
+ topic_model.get_topic(topic_id, full=True)
40
+ ```
41
+
42
  ## Topic overview
43
 
44
  * Number of topics: 29
 
81
 
82
  </details>
83
 
84
+ ## Training Procedure
85
+
86
+ The data was retrieved as follows:
87
+
88
+ ```python
89
+ import os
90
+ import glob
91
+ import zipfile
92
+ import numpy as np
93
+ import pandas as pd
94
+ from tqdm import tqdm
95
+ from sentence_transformers import util
96
+
97
+ # Flickr 8k images
98
+ img_folder = 'photos/'
99
+ caps_folder = 'captions/'
100
+ if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
101
+ os.makedirs(img_folder, exist_ok=True)
102
+
103
+ if not os.path.exists('Flickr8k_Dataset.zip'): #Download dataset if does not exist
104
+ util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip')
105
+ util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip')
106
+
107
+ for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]:
108
+ with zipfile.ZipFile(file, 'r') as zf:
109
+ for member in tqdm(zf.infolist(), desc='Extracting'):
110
+ zf.extract(member, folder)
111
+ images = list(glob.glob('photos/Flicker8k_Dataset/*.jpg'))
112
+ ```
113
+
114
+ Then, to perform topic modeling on multimodal data with BERTopic:
115
+
116
+ ```python
117
+ from bertopic import BERTopic
118
+ from bertopic.backend import MultiModalBackend
119
+ from bertopic.representation import VisualRepresentation, KeyBERTInspired
120
+
121
+ # Image embedding model
122
+ embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)
123
+
124
+ # Image to text representation model
125
+ representation_model = {
126
+ "Visual_Aspect": VisualRepresentation(image_to_text_model="nlpconnect/vit-gpt2-image-captioning", image_squares=True),
127
+ "KeyBERT": KeyBERTInspired()
128
+ }
129
+
130
+ # Train our model with images only
131
+ topic_model = BERTopic(representation_model=representation_model, verbose=True, embedding_model=embedding_model, min_topic_size=30)
132
+ topics, probs = topic_model.fit_transform(documents=None, images=images)
133
+ ```
134
+
135
+ The above demonstrates that the input were only images. These images are clustered and from those clusters a small subset of representative images are extracted. The representative images are captioned using `"nlpconnect/vit-gpt2-image-captioning"` to generate a small textual dataset over which we can run c-TF-IDF and the additional
136
+ `KeyBERTInspired` representation model.
137
+
138
  ## Training hyperparameters
139
 
140
  * calculate_probabilities: False