D0men1c0 commited on
Commit
c8e252a
1 Parent(s): 9e89225

Add BERTopic model

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ ctfidf_config.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # ISSR_Dark_Web_121Topics
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("D0men1c0/ISSR_Dark_Web_121Topics")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 122
34
+ * Number of training documents: 260996
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | vendor - order - nt - market - link | 509 | Outliers |
42
+ | 0 | cart - weed - strain - thc - bud | 91708 | Product Reviews and Purchases |
43
+ | 1 | deposit - address - ticket - btc - wallet | 14083 | Empire Deposit & Withdrawal Issues |
44
+ | 2 | key - pgp - account - pgp key - password | 5365 | PGP Key Security |
45
+ | 3 | order - shipped - ordered - day - week | 5696 | Order Shipping Status |
46
+ | 4 | scam - scammer - scam scam - scam scam scam - scammed | 3368 | Vendor Scams and Detection |
47
+ | 5 | thanks - thank - lol - man - bro | 3187 | Friendly Positive Talk |
48
+ | 6 | ship - country - eu - shipping - uk | 4584 | Shipping in EU Countries |
49
+ | 7 | coke - cocaine - quality - product - good | 3425 | High Quality Cocaine |
50
+ | 8 | card - carding - cc - gift card - gift | 2846 | Carding Strategies |
51
+ | 9 | pgp - begin pgp - begin - pgp signature - signature | 3283 | PGP Signature End |
52
+ | 10 | lsd - tab - ug - acid - blotter | 2462 | LSD Tab Marketplace Reviews |
53
+ | 11 | vendor - good - anyone - know - legit | 2562 | Vendor Recommendation |
54
+ | 12 | dispute - refund - mod - order - moderator | 2692 | Dispute Resolution |
55
+ | 13 | wsm - dream - market - exit - vendor | 2100 | WSM Exit Scam Warnings |
56
+ | 14 | drug - police - get - nt - house | 1970 | Drugs and Police Enforcement |
57
+ | 15 | monero - xmr - wallet - btc - exchange | 2214 | Monero Wallet Exchange and Bitcoin Use |
58
+ | 16 | ddos - attack - ddos attack - mirror - market | 2995 | Dread Market DDoS Attack |
59
+ | 17 | mdma - mda - price - quality - vendor | 1773 | MDMA Vendor Quality Prices |
60
+ | 18 | darknet - clearnet - dark - link - darknetmarkets | 1925 | Darknet Market |
61
+ | 19 | sub - post - mod - banned - link | 1887 | Dread Market Forum Rules and Bans |
62
+ | 20 | bar - alp - hulk - press - pack | 1942 | Alprazolam Pressed Bars Reviews |
63
+ | 21 | xanax - bar - alp - mg - alprazolam | 2350 | Xanax Bars and Vendors |
64
+ | 22 | market - new market - market market - good - new | 2036 | Market Upgrade Support |
65
+ | 23 | feedback - review - vendor - negative - positive | 1650 | ecommerce feedback |
66
+ | 24 | mirror - working - working mirror - mirror link - link | 1802 | Mirror Link Working |
67
+ | 25 | mg - pill - tablet - price - xtc | 1228 | Drug Sales |
68
+ | 26 | box - mail - package - address - po | 3778 | Mail Delivery Issues |
69
+ | 27 | review - thanks - thanks review - review thanks - nice review | 2829 | Positive Reviews Thank You Notes Nice |
70
+ | 28 | ticket - support ticket - support - se en - se | 1242 | Support Ticket Confusion |
71
+ | 29 | cryptonia - market - empire - nightmare - vendor | 1331 | Cryptonia Market |
72
+ | 30 | escrow - fe - use escrow - vendor - market | 1214 | Market Escrow Usage |
73
+ | 31 | onion - dot onion - dot - onion link - onion site | 1216 | Onion Links |
74
+ | 32 | det - er - og - har - jeg | 1226 | Kola;Vendor;Stealth Shipping;Review;Norway |
75
+ | 33 | tor - browser - network - javascript - tor browser | 1126 | Anonymous Browsing and Tor Networks |
76
+ | 34 | dread - reddit - post - dread dread - sub | 1109 | community appreciation |
77
+ | 35 | meth - business day - business - day - good | 1074 | Meth Vendor Quality Review |
78
+ | 36 | fent - fentanyl - opiate - heroin - nt | 1161 | Fentanyl Opiate Discussion |
79
+ | 37 | link - link link - point link comment - link point link - link comment post | 1695 | Links and Posts |
80
+ | 38 | pack - week - day - ordered - land | 1332 | Package Delay and Shipping |
81
+ | 39 | pm - interested - looking - find - please | 1499 | PM Interested Help Explanation |
82
+ | 40 | hugbunter - hugbunter hugbunter - link hugbunter - hugbunter link - link hugbunter hugbunter | 1157 | hugbunter links |
83
+ | 41 | drug - police - court - enforcement - investigation | 923 | Darknet Drug Enforcement |
84
+ | 42 | stealth - good - good stealth - vendor - shipping | 849 | Good Stealth Vendor Shipping |
85
+ | 43 | counterfeit - note - euro - bill - pen | 968 | Counterfeit Money Sales |
86
+ | 44 | empire - nightmare - empire empire - find empire - empire nightmare | 1010 | Empire Name Search |
87
+ | 45 | day - waiting - week - month - hour | 1040 | Waiting Time |
88
+ | 46 | id - passport - fake - license - scan | 1681 | Fake IDs & Documents |
89
+ | 47 | bank - account - drop - bank drop - cash | 1079 | Bank Drop Transaction |
90
+ | 48 | wickr - use wickr - using wickr - contact - via wickr | 1850 | Wickr Abuse Policy Protect Wickr Community |
91
+ | 49 | de - und - que - un - da | 751 | German Darknet Market |
92
+ | 50 | phishing - phishing link - link - phished - phishing site | 740 | Phishing Detection Techniques |
93
+ | 51 | dream - nightmare - dream dream - anyone - like | 1041 | Dream Nightmare Experience |
94
+ | 52 | price - sale - promo - sell - buy | 756 | Sale;Promotional Offers;Good Deals |
95
+ | 53 | tails - tail - usb - electrum - persistent | 1101 | Tails;Electrum;Persistent File;USB Installation |
96
+ | 54 | adderall - amphetamine - mg - replacement - speed | 1190 | Adderall Replacement Pills |
97
+ | 55 | cancel - order - auto - cancel order - day | 879 | Auto Cancel Orders |
98
+ | 56 | mushroom - shrooms - cubensis - psilocybin - spore | 963 | Mushroom Guide & Dosage Information |
99
+ | 57 | ketamine - gm - gm gm - gm gm gm - vendor | 763 | Ketamine Vendor Quality Shard |
100
+ | 58 | exit - exit scam - scam - exit scamming - exit scammed | 696 | Exit Scam Market |
101
+ | 59 | phone - burner - sim - card - number | 691 | Burner Phone Usage |
102
+ | 60 | dream - market - dream market - nightmare - nightmare market | 830 | Dream Market |
103
+ | 61 | bond - vendor bond - vendor - bond back - market | 860 | Vendor Bond Waiver Market |
104
+ | 62 | vpn - tor - use - using - proxy | 546 | VPN and Tor Use |
105
+ | 63 | jabber - telegram - xmpp - pidgin - otr | 1054 | Jabber/XMPP/OTR Chat Clients |
106
+ | 64 | dmt - psychedelics - per - psychedelic - changa | 588 | DMT Psychedelics Prices |
107
+ | 65 | captcha - captchas - page - enter - login | 697 | Captcha Issues in Darknet Market |
108
+ | 66 | sample - free sample - free - review - sample pack | 526 | Free Sample Order |
109
+ | 67 | update - issue - problem - working - fixed | 682 | Fixed Bug Issue |
110
+ | 68 | cgmc - invite - vendor - cgmc cgmc - link cgmc | 1410 | CGMC Invites |
111
+ | 69 | apollon - apollon market - market - empire - apollomarket | 483 | Apollon Market Update |
112
+ | 70 | paypal - transfer - account - paypal account - paypal transfer | 508 | PayPal Account Transfer |
113
+ | 71 | giveaway - win - number - winner - contest | 681 | Giveaways & Contests |
114
+ | 72 | pm - working link - link - link please - please | 739 | Working Link Requests |
115
+ | 73 | darkfail - link - fail - dark - dark fail | 516 | Dark Fail Links Question |
116
+ | 74 | empire - market - empire market - nightmare - alphabay | 598 | Empire Market Vendor Feedback |
117
+ | 75 | package - pack - delivery - tracking - day | 1144 | Package Delivery Tracking |
118
+ | 76 | bag - dog - seal - mylar - vac | 1816 | Smuggling Methods & Detection |
119
+ | 77 | opsec - opsec opsec - link opsec - opsec link - opsec opsec link | 542 | Opsec Guidance |
120
+ | 78 | link - working - working link - main link - link working | 392 | Link issues |
121
+ | 79 | money - pay - money back - dollar - get | 875 | Money Losses |
122
+ | 80 | tracking - tracking number - number - order - day | 513 | Tracking Number Concerns |
123
+ | 81 | guide - tutorial - outdated - thanks - method | 1078 | Guide Topic or Tutorial Help or Out |
124
+ | 82 | bir - bu - kai - ama - var | 357 | bu bir kai |
125
+ | 83 | rc - mxe - dck - rcs - fdck | 356 | RC Sources |
126
+ | 84 | cash - btc - bitcoin - coinbase - atm | 549 | Crypto Purchase Methods |
127
+ | 85 | olympus - market - fe escrow - olympus market - dream | 1271 | Olympus Market |
128
+ | 86 | log - logged - logging - login - page | 338 | Logging and Session Issues |
129
+ | 87 | vacation - vacation mode - mode - back - profile | 401 | Vacation Mode |
130
+ | 88 | message - contact - email - support - send | 331 | Message;Email Support |
131
+ | 89 | post - mod - comment - delete - thread | 748 | Moderation and Deletion of Posts |
132
+ | 90 | xmr - wallet - deposit - monero - payment id | 1369 | XMR Wallet Issue |
133
+ | 91 | image - exif - upload - exif data - data | 513 | Image Exif Data Upload |
134
+ | 92 | back - hope - welcome back - luck - good | 399 | hope recovery |
135
+ | 93 | review - template - pic - picture - table | 2240 | Review templates and images |
136
+ | 94 | cheer - cheer cheer - cheer mate - mate - anyone | 547 | Cheer Positivity Justification |
137
+ | 95 | bulk - price - kratom - good - kg | 307 | Bulk Kratom Vendors |
138
+ | 96 | wallstreet - wall st - wall - st - wallstreetmarket | 675 | Wallstreet Market Forum Links |
139
+ | 97 | product - stealth - shipping - quality - price | 379 | Product Review |
140
+ | 98 | listing - list - superlist - vendor - search | 2037 | Listing Management and Visibility |
141
+ | 99 | fuck - cunt - dick - fud fud - fud fud fud | 493 | Mom sex;Insults |
142
+ | 100 | empire - exit - market - scam - exit scam | 3076 | Empire Market |
143
+ | 101 | protonmail - protonmailcom - email - proton - secmail | 1138 | Protonmail Alternatives |
144
+ | 102 | wallet - node - gui - monero - remote node | 296 | Monero Wallet Update |
145
+ | 103 | multisig - market - transaction - escrow - use multisig | 423 | MultiSig Market Transactions |
146
+ | 104 | bunk - bunk bar - bar - hulk - sent bunk | 307 | Bunk and Bar |
147
+ | 105 | mg - benzo - benzos - alprazolam - alp | 355 | Benzodiazepine use and abuse |
148
+ | 106 | pelican - bird - pelicanvendor - bigbird - pelicanvendor pelicanvendor link | 2460 | Pelican Bird Giveaway |
149
+ | 107 | heinekenexpress - link heinekenexpress - heinekenexpress link - heinekenexpress heinekenexpress - link heinekenexpress heinekenexpress | 249 | Heineken Express Reviews |
150
+ | 108 | rdp - sock - vpn - ip - card | 266 | RDP Socks for Carding |
151
+ | 109 | dnm - dm - dread - forum - link | 401 | DNM Reddit Subs |
152
+ | 110 | pic - picture - photo - photoshop - post pic | 598 | Pics and posts |
153
+ | 111 | empire - link - empiremarket - empire link - link empire | 476 | Empire Market Links |
154
+ | 112 | invite - invite code - code - need invite - get invite | 398 | Darknet Market Invites |
155
+ | 113 | samsara - market - samsara market - sam - dream | 224 | Samsara Market |
156
+ | 114 | chemical - test - lab - powder - product | 290 | Chemistry Research and Supply |
157
+ | 115 | rapture - rapture market - rapturemarket - market - gbp | 1064 | Rapture Market GBP |
158
+ | 116 | water - acetone - powder - dry - ml | 214 | Acetone Recrystallization Techniques |
159
+ | 117 | witchman - link witchman - link - witchman link - witchman witchman | 1754 | Link Witchman Discussion |
160
+ | 118 | tochka - market - tochka market - tochka tochka - use tochka | 218 | Tochka market |
161
+ | 119 | post - know guy know - guy know guy - know guy - guy know | 210 | Read Post Discussion |
162
+ | 120 | subdread - sub - post - subdreads - create | 2589 | Subdread creation issues |
163
+
164
+ </details>
165
+
166
+ ## Training hyperparameters
167
+
168
+ * calculate_probabilities: True
169
+ * language: None
170
+ * low_memory: True
171
+ * min_topic_size: 10
172
+ * n_gram_range: (1, 3)
173
+ * nr_topics: None
174
+ * seed_topic_list: None
175
+ * top_n_words: 10
176
+ * verbose: True
177
+ * zeroshot_min_similarity: 0.7
178
+ * zeroshot_topic_list: None
179
+
180
+ ## Framework versions
181
+
182
+ * Numpy: 1.26.4
183
+ * HDBSCAN: 0.8.36
184
+ * UMAP: 0.5.6
185
+ * Pandas: 2.2.1
186
+ * Scikit-Learn: 1.4.1.post1
187
+ * Sentence-transformers: 3.0.1
188
+ * Transformers: 4.39.3
189
+ * Numba: 0.60.0
190
+ * Plotly: 5.22.0
191
+ * Python: 3.12.2
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": true,
3
+ "language": null,
4
+ "low_memory": true,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 3
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": true,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null,
16
+ "embedding_model": "all-MiniLM-L6-v2"
17
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d18e02fc04cddb360f2a3d21ee0332ad68496672ad3f31ba2227e8aaff1218b8
3
+ size 233761760
ctfidf_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a152061128403e7f7531bd3d07610aa71e55320742a7b1659e5bace1c04a7bd4
3
+ size 371499932
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6bee9de4fd6336f84b3d8751801c5278fde32cb185da4168479e52a0b7b5fca
3
+ size 187480
topics.json ADDED
The diff for this file is too large to render. See raw diff