MaartenGr commited on
Commit
f1a31f1
1 Parent(s): 7aee05e

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ ---
7
+
8
+ # BERTopic_Multimodal
9
+
10
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
11
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
12
+
13
+ ## Usage
14
+
15
+ To use this model, please install BERTopic:
16
+
17
+ ```
18
+ pip install -U bertopic
19
+ ```
20
+
21
+ You can use the model as follows:
22
+
23
+ ```python
24
+ from bertopic import BERTopic
25
+ topic_model = BERTopic.load("MaartenGr/BERTopic_Multimodal")
26
+
27
+ topic_model.get_topic_info()
28
+ ```
29
+
30
+ ## Topic overview
31
+
32
+ * Number of topics: 29
33
+ * Number of training documents: 8091
34
+
35
+ <details>
36
+ <summary>Click here for an overview of all topics.</summary>
37
+
38
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
39
+ |----------|----------------|-----------------|-------|
40
+ | -1 | while - air - the - in - jumping | 34 | -1_while_air_the_in |
41
+ | 0 | bench - sitting - people - woman - street | 1132 | 0_bench_sitting_people_woman |
42
+ | 1 | grass - running - dog - grassy - field | 1693 | 1_grass_running_dog_grassy |
43
+ | 2 | boy - girl - little - young - holding | 1290 | 2_boy_girl_little_young |
44
+ | 3 | dog - frisbee - running - water - mouth | 1224 | 3_dog_frisbee_running_water |
45
+ | 4 | skateboard - ramp - doing - trick - cement | 415 | 4_skateboard_ramp_doing_trick |
46
+ | 5 | snow - dog - covered - running - through | 309 | 5_snow_dog_covered_running |
47
+ | 6 | mountain - range - slope - standing - person | 205 | 6_mountain_range_slope_standing |
48
+ | 7 | pool - blue - boy - toy - water | 189 | 7_pool_blue_boy_toy |
49
+ | 8 | trail - bike - down - riding - person | 166 | 8_trail_bike_down_riding |
50
+ | 9 | snowboarder - mid - jump - air - after | 126 | 9_snowboarder_mid_jump_air |
51
+ | 10 | rock - climbing - up - wall - tree | 124 | 10_rock_climbing_up_wall |
52
+ | 11 | wave - surfboard - top - riding - of | 112 | 11_wave_surfboard_top_riding |
53
+ | 12 | beach - surfboard - people - with - walking | 102 | 12_beach_surfboard_people_with |
54
+ | 13 | jumping - track - horse - racquet - dog | 98 | 13_jumping_track_horse_racquet |
55
+ | 14 | snowboard - snow - girl - hill - slope | 95 | 14_snowboard_snow_girl_hill |
56
+ | 15 | game - being - football - played - professional | 91 | 15_game_being_football_played |
57
+ | 16 | soccer - kicking - team - ball - player | 80 | 16_soccer_kicking_team_ball |
58
+ | 17 | dirt - bike - person - rider - going | 75 | 17_dirt_bike_person_rider |
59
+ | 18 | soccer - boys - field - ball - kicking | 69 | 18_soccer_boys_field_ball |
60
+ | 19 | baseball - player - bat - swinging - into | 63 | 19_baseball_player_bat_swinging |
61
+ | 20 | basketball - up - and - playing - jumping | 59 | 20_basketball_up_and_playing |
62
+ | 21 | bird - body - flying - over - long | 55 | 21_bird_body_flying_over |
63
+ | 22 | motorcycle - track - race - racer - racing | 55 | 22_motorcycle_track_race_racer |
64
+ | 23 | boat - sitting - water - lake - hose | 53 | 23_boat_sitting_water_lake |
65
+ | 24 | street - riding - down - bike - woman | 52 | 24_street_riding_down_bike |
66
+ | 25 | paddle - suit - paddling - water - in | 49 | 25_paddle_suit_paddling_water |
67
+ | 26 | pair - scissors - stage - white - shirt | 42 | 26_pair_scissors_stage_white |
68
+ | 27 | tennis - court - racket - racquet - swinging | 34 | 27_tennis_court_racket_racquet |
69
+
70
+ </details>
71
+
72
+ ## Training hyperparameters
73
+
74
+ * calculate_probabilities: False
75
+ * language: None
76
+ * low_memory: False
77
+ * min_topic_size: 30
78
+ * n_gram_range: (1, 1)
79
+ * nr_topics: None
80
+ * seed_topic_list: None
81
+ * top_n_words: 10
82
+ * verbose: True
83
+
84
+ ## Framework versions
85
+
86
+ * Numpy: 1.23.5
87
+ * HDBSCAN: 0.8.29
88
+ * UMAP: 0.5.3
89
+ * Pandas: 1.5.3
90
+ * Scikit-Learn: 1.2.2
91
+ * Sentence-transformers: 2.2.2
92
+ * Transformers: 4.29.2
93
+ * Numba: 0.56.4
94
+ * Plotly: 5.14.1
95
+ * Python: 3.10.10
config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": null,
4
+ "low_memory": false,
5
+ "min_topic_size": 30,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": true
14
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:465603b0624b3ee3419120940aaf719522f68ab182a3e29e63a1d48381cfd2c1
3
+ size 10240
ctfidf_config.json ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ctfidf_model": {
3
+ "bm25_weighting": false,
4
+ "reduce_frequent_words": false
5
+ },
6
+ "vectorizer_model": {
7
+ "params": {
8
+ "analyzer": "word",
9
+ "binary": false,
10
+ "decode_error": "strict",
11
+ "encoding": "utf-8",
12
+ "input": "content",
13
+ "lowercase": true,
14
+ "max_df": 1.0,
15
+ "max_features": null,
16
+ "min_df": 1,
17
+ "ngram_range": [
18
+ 1,
19
+ 1
20
+ ],
21
+ "stop_words": null,
22
+ "strip_accents": null,
23
+ "token_pattern": "(?u)\\b\\w\\w+\\b",
24
+ "vocabulary": null
25
+ },
26
+ "vocab": {
27
+ "woman": 203,
28
+ "is": 85,
29
+ "playing": 131,
30
+ "with": 202,
31
+ "frisbee": 60,
32
+ "on": 113,
33
+ "the": 176,
34
+ "beach": 17,
35
+ "man": 101,
36
+ "in": 83,
37
+ "blue": 28,
38
+ "shirt": 152,
39
+ "water": 196,
40
+ "young": 207,
41
+ "boy": 33,
42
+ "wet": 198,
43
+ "suit": 169,
44
+ "standing": 166,
45
+ "person": 124,
46
+ "jumping": 91,
47
+ "air": 2,
48
+ "while": 199,
49
+ "holding": 78,
50
+ "surfboard": 171,
51
+ "rock": 148,
52
+ "climbing": 44,
53
+ "wall": 193,
54
+ "two": 188,
55
+ "boys": 34,
56
+ "are": 7,
57
+ "to": 179,
58
+ "catch": 39,
59
+ "walking": 192,
60
+ "down": 52,
61
+ "street": 167,
62
+ "cell": 40,
63
+ "phone": 125,
64
+ "crowd": 47,
65
+ "of": 112,
66
+ "people": 122,
67
+ "sitting": 153,
68
+ "bench": 21,
69
+ "outside": 116,
70
+ "building": 36,
71
+ "talking": 173,
72
+ "next": 110,
73
+ "group": 69,
74
+ "dog": 49,
75
+ "bird": 26,
76
+ "her": 73,
77
+ "head": 72,
78
+ "running": 150,
79
+ "grass": 66,
80
+ "through": 177,
81
+ "grassy": 67,
82
+ "field": 56,
83
+ "dogs": 50,
84
+ "each": 54,
85
+ "other": 114,
86
+ "across": 0,
87
+ "and": 5,
88
+ "cat": 38,
89
+ "its": 87,
90
+ "mouth": 108,
91
+ "black": 27,
92
+ "brown": 35,
93
+ "girl": 63,
94
+ "hugging": 82,
95
+ "baseball": 14,
96
+ "bat": 16,
97
+ "his": 75,
98
+ "hand": 70,
99
+ "little": 99,
100
+ "red": 143,
101
+ "umbrella": 189,
102
+ "metal": 103,
103
+ "pole": 132,
104
+ "wooden": 205,
105
+ "bed": 19,
106
+ "skateboard": 154,
107
+ "middle": 105,
108
+ "park": 121,
109
+ "throwing": 178,
110
+ "hill": 74,
111
+ "ball": 11,
112
+ "along": 4,
113
+ "body": 31,
114
+ "skateboarder": 155,
115
+ "doing": 51,
116
+ "trick": 187,
117
+ "ramp": 140,
118
+ "riding": 145,
119
+ "top": 182,
120
+ "cement": 41,
121
+ "bike": 24,
122
+ "snow": 159,
123
+ "covered": 46,
124
+ "snowboard": 160,
125
+ "snowy": 163,
126
+ "surface": 170,
127
+ "slope": 158,
128
+ "tongue": 181,
129
+ "out": 115,
130
+ "laying": 97,
131
+ "fighting": 57,
132
+ "over": 117,
133
+ "ground": 68,
134
+ "jacket": 88,
135
+ "mountain": 107,
136
+ "range": 141,
137
+ "up": 191,
138
+ "backpack": 10,
139
+ "skis": 156,
140
+ "children": 43,
141
+ "pool": 133,
142
+ "toy": 183,
143
+ "baby": 9,
144
+ "board": 29,
145
+ "white": 200,
146
+ "boogie": 32,
147
+ "dirt": 48,
148
+ "trail": 185,
149
+ "snowboarder": 161,
150
+ "pile": 126,
151
+ "mid": 104,
152
+ "after": 1,
153
+ "jump": 90,
154
+ "airborne": 3,
155
+ "going": 65,
156
+ "tree": 186,
157
+ "near": 109,
158
+ "rope": 149,
159
+ "wave": 197,
160
+ "ocean": 111,
161
+ "track": 184,
162
+ "horse": 80,
163
+ "racquet": 139,
164
+ "pair": 120,
165
+ "jockeys": 89,
166
+ "racing": 137,
167
+ "race": 135,
168
+ "rider": 144,
169
+ "snowboarding": 162,
170
+ "child": 42,
171
+ "football": 59,
172
+ "game": 61,
173
+ "being": 20,
174
+ "played": 128,
175
+ "during": 53,
176
+ "player": 129,
177
+ "plate": 127,
178
+ "professional": 134,
179
+ "soccer": 164,
180
+ "team": 174,
181
+ "watching": 195,
182
+ "large": 96,
183
+ "men": 102,
184
+ "kicking": 94,
185
+ "getting": 62,
186
+ "ready": 142,
187
+ "kick": 93,
188
+ "road": 147,
189
+ "sliding": 157,
190
+ "into": 84,
191
+ "home": 79,
192
+ "another": 6,
193
+ "watches": 194,
194
+ "swinging": 172,
195
+ "base": 13,
196
+ "at": 8,
197
+ "uniform": 190,
198
+ "basketball": 15,
199
+ "players": 130,
200
+ "hit": 76,
201
+ "court": 45,
202
+ "hitting": 77,
203
+ "flying": 58,
204
+ "long": 100,
205
+ "beak": 18,
206
+ "motorcycle": 106,
207
+ "racer": 136,
208
+ "boat": 30,
209
+ "paddling": 119,
210
+ "canoe": 37,
211
+ "lake": 95,
212
+ "hat": 71,
213
+ "paddle": 118,
214
+ "river": 146,
215
+ "bank": 12,
216
+ "hose": 81,
217
+ "women": 204,
218
+ "bicycles": 23,
219
+ "bicycle": 22,
220
+ "it": 86,
221
+ "life": 98,
222
+ "yellow": 206,
223
+ "bikini": 25,
224
+ "kayaker": 92,
225
+ "together": 180,
226
+ "scissors": 151,
227
+ "tennis": 175,
228
+ "performing": 123,
229
+ "stage": 165,
230
+ "wii": 201,
231
+ "striped": 168,
232
+ "elephant": 55,
233
+ "girls": 64,
234
+ "racket": 138
235
+ }
236
+ }
237
+ }
images/-1.jpg ADDED
images/0.jpg ADDED
images/1.jpg ADDED
images/10.jpg ADDED
images/11.jpg ADDED
images/12.jpg ADDED
images/13.jpg ADDED
images/14.jpg ADDED
images/15.jpg ADDED
images/16.jpg ADDED
images/17.jpg ADDED
images/18.jpg ADDED
images/19.jpg ADDED
images/2.jpg ADDED
images/20.jpg ADDED
images/21.jpg ADDED
images/22.jpg ADDED
images/23.jpg ADDED
images/24.jpg ADDED
images/25.jpg ADDED
images/26.jpg ADDED
images/27.jpg ADDED
images/3.jpg ADDED
images/4.jpg ADDED
images/5.jpg ADDED
images/6.jpg ADDED
images/7.jpg ADDED
images/8.jpg ADDED
images/9.jpg ADDED
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:905f4141e127f6fcf1b9e7a99442195b61c263b7ddfe15847f3f51b96a4ff188
3
+ size 59480
topics.json ADDED
The diff for this file is too large to render. See raw diff