OSN2 commited on
Commit
cb3f3ee
1 Parent(s): 7ed86fa

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # ArXiv
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("OSN2/ArXiv")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 171
34
+ * Number of training documents: 12693
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | the - and - to - of - in | 15 | -1_the_and_to_of |
42
+ | 0 | recipe - food - recipes - pizza - salad | 3814 | 0_recipe_food_recipes_pizza |
43
+ | 1 | trump - election - law - the - that | 849 | 1_trump_election_law_the |
44
+ | 2 | anaysa - fashion - pants - swimwear - sneakers | 393 | 2_anaysa_fashion_pants_swimwear |
45
+ | 3 | arsenal - liverpool - rugby - match - haaland | 382 | 3_arsenal_liverpool_rugby_match |
46
+ | 4 | weather - bengal - storm - west - snow | 271 | 4_weather_bengal_storm_west |
47
+ | 5 | crypto - bitcoin - cryptocurrency - gaming - trading | 172 | 5_crypto_bitcoin_cryptocurrency_gaming |
48
+ | 6 | her - she - was - on - related | 143 | 6_her_she_was_on |
49
+ | 7 | 420m - dog - animal - animals - dogs | 138 | 7_420m_dog_animal_animals |
50
+ | 8 | god - lord - prayer - jesus - church | 127 | 8_god_lord_prayer_jesus |
51
+ | 9 | cars - sale - used - under - for | 119 | 9_cars_sale_used_under |
52
+ | 10 | pro - vivo - v23 - phone - google | 117 | 10_pro_vivo_v23_phone |
53
+ | 11 | news - iptv - tv - interview - latest | 110 | 11_news_iptv_tv_interview |
54
+ | 12 | art - museum - artists - artist - of | 108 | 12_art_museum_artists_artist |
55
+ | 13 | my - nephews - nieces - poetry - love | 107 | 13_my_nephews_nieces_poetry |
56
+ | 14 | film - review - his - as - but | 102 | 14_film_review_his_as |
57
+ | 15 | bike - helmet - bikes - mountain - pilots | 98 | 15_bike_helmet_bikes_mountain |
58
+ | 16 | hair - bite - steel - care - haircut | 97 | 16_hair_bite_steel_care |
59
+ | 17 | police - rhonda - mcdowell - was - said | 90 | 17_police_rhonda_mcdowell_was |
60
+ | 18 | property - room - bedrooms - bedroom - home | 86 | 18_property_room_bedrooms_bedroom |
61
+ | 19 | ukraine - russia - russian - putin - news | 86 | 19_ukraine_russia_russian_putin |
62
+ | 20 | business - jobs - income - data - part | 86 | 20_business_jobs_income_data |
63
+ | 21 | vaccinated - vaccine - covid - va - unvaccinated | 84 | 21_vaccinated_vaccine_covid_va |
64
+ | 22 | music - band - students - orchestra - tickets | 83 | 22_music_band_students_orchestra |
65
+ | 23 | workout - abs - workouts - fitness - exercise | 83 | 23_workout_abs_workouts_fitness |
66
+ | 24 | school - teachers - dmc - 804 - children | 83 | 24_school_teachers_dmc_804 |
67
+ | 25 | women - robotics - bali - spanish - lutheran | 82 | 25_women_robotics_bali_spanish |
68
+ | 26 | lima - tourism - parks - urban - our | 79 | 26_lima_tourism_parks_urban |
69
+ | 27 | godzilla - movies - movie - spider - marvel | 77 | 27_godzilla_movies_movie_spider |
70
+ | 28 | fishing - backpacks - fish - packs - swimming | 74 | 28_fishing_backpacks_fish_packs |
71
+ | 29 | yoga - stretching - kru - nidra - oct | 74 | 29_yoga_stretching_kru_nidra |
72
+ | 30 | researchers - species - of - the - university | 72 | 30_researchers_species_of_the |
73
+ | 31 | wholesale - market - saree - delhi - software | 71 | 31_wholesale_market_saree_delhi |
74
+ | 32 | skin - acne - cream - blackheads - whitening | 70 | 32_skin_acne_cream_blackheads |
75
+ | 33 | rodents - pets - pest - dogs - animals | 70 | 33_rodents_pets_pest_dogs |
76
+ | 34 | books - book - salinger - fiction - literary | 67 | 34_books_book_salinger_fiction |
77
+ | 35 | class - pst - exams - preparation - test | 66 | 35_class_pst_exams_preparation |
78
+ | 36 | 5g - airlines - bsnl - flight - network | 64 | 36_5g_airlines_bsnl_flight |
79
+ | 37 | treetops - dementia - children - people - barbara | 62 | 37_treetops_dementia_children_people |
80
+ | 38 | lottery - thai - thailand - lotto - win | 62 | 38_lottery_thai_thailand_lotto |
81
+ | 39 | wedding - weddings - survival - gift - day | 61 | 39_wedding_weddings_survival_gift |
82
+ | 40 | quantum - solar - energy - material - light | 61 | 40_quantum_solar_energy_material |
83
+ | 41 | beauty - makeup - products - sephora - skin | 60 | 41_beauty_makeup_products_sephora |
84
+ | 42 | games - xbox - game - solitaire - free | 60 | 42_games_xbox_game_solitaire |
85
+ | 43 | insurance - insurers - insurer - company - aig | 59 | 43_insurance_insurers_insurer_company |
86
+ | 44 | green - saf - haiti - industry - solar | 58 | 44_green_saf_haiti_industry |
87
+ | 45 | diet - meat - foods - plant - body | 55 | 45_diet_meat_foods_plant |
88
+ | 46 | edinburgh - tour - royal - travel - castle | 55 | 46_edinburgh_tour_royal_travel |
89
+ | 47 | horses - horse - friesian - goëngamieden - post | 54 | 47_horses_horse_friesian_goëngamieden |
90
+ | 48 | your - you - mental - health - anal | 51 | 48_your_you_mental_health |
91
+ | 49 | weight - obesity - loss - lose - fat | 51 | 49_weight_obesity_loss_lose |
92
+ | 50 | estate - real - property - home - you | 50 | 50_estate_real_property_home |
93
+ | 51 | camping - surfing - guess - landmark - lego | 50 | 51_camping_surfing_guess_landmark |
94
+ | 52 | dorm - sex - birthday - my - joy | 50 | 52_dorm_sex_birthday_my |
95
+ | 53 | covid - 19 - vaccinated - vaccine - cases | 50 | 53_covid_19_vaccinated_vaccine |
96
+ | 54 | spain - morocco - gas - energy - industry | 49 | 54_spain_morocco_gas_energy |
97
+ | 55 | gardening - garden - grow - plants - fertilizer | 49 | 55_gardening_garden_grow_plants |
98
+ | 56 | tenant - transport - apartments - department - condos | 49 | 56_tenant_transport_apartments_department |
99
+ | 57 | cricket - england - engw - indw - vs | 48 | 57_cricket_england_engw_indw |
100
+ | 58 | trump - election - party - votes - former | 48 | 58_trump_election_party_votes |
101
+ | 59 | tesla - marine - electric - musk - ev | 47 | 59_tesla_marine_electric_musk |
102
+ | 60 | surf - surfing - ski - swimming - lessons | 47 | 60_surf_surfing_ski_swimming |
103
+ | 61 | disabled - disability - thailand - scholarship - scholarships | 47 | 61_disabled_disability_thailand_scholarship |
104
+ | 62 | programming - udemy - svelte - language - courses | 44 | 62_programming_udemy_svelte_language |
105
+ | 63 | diy - ideas - desk - wood - woodworking | 43 | 63_diy_ideas_desk_wood |
106
+ | 64 | wrestling - pearson - tiga - wwe - nfl | 43 | 64_wrestling_pearson_tiga_wwe |
107
+ | 65 | smart - gadgets - appliances - home - kitchen | 42 | 65_smart_gadgets_appliances_home |
108
+ | 66 | experiments - fu - kung - xxxtentacion - copyright | 40 | 66_experiments_fu_kung_xxxtentacion |
109
+ | 67 | job - small - businesses - hiring - business | 40 | 67_job_small_businesses_hiring |
110
+ | 68 | hiv - health - care - hospital - hospice | 40 | 68_hiv_health_care_hospital |
111
+ | 69 | he - was - it - empire - movie | 38 | 69_he_was_it_empire |
112
+ | 70 | beat - type - ringtone - lofi - beats | 37 | 70_beat_type_ringtone_lofi |
113
+ | 71 | castellvi - marines - marine - corps - county | 37 | 71_castellvi_marines_marine_corps |
114
+ | 72 | casino - xbox - game - games - poker | 37 | 72_casino_xbox_game_games |
115
+ | 73 | bellanaijaweddings - bride - handmadepaper - weddingplanner - makeup | 36 | 73_bellanaijaweddings_bride_handmadepaper_weddingplanner |
116
+ | 74 | music - jsem - bushcraft - se - festival | 36 | 74_music_jsem_bushcraft_se |
117
+ | 75 | gemini - tarot - horoscope - september - pisces | 35 | 75_gemini_tarot_horoscope_september |
118
+ | 76 | career - husni - magazines - magazine - employees | 35 | 76_career_husni_magazines_magazine |
119
+ | 77 | his - film - movie - review - but | 34 | 77_his_film_movie_review |
120
+ | 78 | gps - aircraft - trucks - vehicles - electric | 34 | 78_gps_aircraft_trucks_vehicles |
121
+ | 79 | raya - merch - magazines - cards - kongamidyearshoppingfestival | 34 | 79_raya_merch_magazines_cards |
122
+ | 80 | baby - she - birth - says - women | 34 | 80_baby_she_birth_says |
123
+ | 81 | covid - 19 - uk - health - interventions | 33 | 81_covid_19_uk_health |
124
+ | 82 | climate - gore - dm - eastman - change | 33 | 82_climate_gore_dm_eastman |
125
+ | 83 | buhari - anambra - apc - anyim - chief | 32 | 83_buhari_anambra_apc_anyim |
126
+ | 84 | orchestra - hotel - janice - chicago - symphony | 31 | 84_orchestra_hotel_janice_chicago |
127
+ | 85 | ramen - pierre - soulz - magic - westfieldcarousel | 31 | 85_ramen_pierre_soulz_magic |
128
+ | 86 | interior - design - home - decorate - bedroom | 30 | 86_interior_design_home_decorate |
129
+ | 87 | hindi - movie - explained - hollywood - lankybox | 30 | 87_hindi_movie_explained_hollywood |
130
+ | 88 | xbox - playstation - game - card - console | 30 | 88_xbox_playstation_game_card |
131
+ | 89 | insurance - car - policy - feener - policyworld | 30 | 89_insurance_car_policy_feener |
132
+ | 90 | share - nepal - stock - market - analysis | 29 | 90_share_nepal_stock_market |
133
+ | 91 | marketing - content - strategy - cart - your | 28 | 91_marketing_content_strategy_cart |
134
+ | 92 | songs - kids - song - rhymes - hindi | 28 | 92_songs_kids_song_rhymes |
135
+ | 93 | tax - cd - money - itr - 401 | 27 | 93_tax_cd_money_itr |
136
+ | 94 | inflation - housing - prices - chorley - hydrow | 27 | 94_inflation_housing_prices_chorley |
137
+ | 95 | venkat - spectre - spending - attacks - intel | 26 | 95_venkat_spectre_spending_attacks |
138
+ | 96 | band - grammys - recording - musical - doo | 26 | 96_band_grammys_recording_musical |
139
+ | 97 | drawing - draw - art - mandala - painting | 26 | 97_drawing_draw_art_mandala |
140
+ | 98 | shop - insurance - design - restaurant - food | 26 | 98_shop_insurance_design_restaurant |
141
+ | 99 | kamran - feride - iqiyi - drama - selim | 26 | 99_kamran_feride_iqiyi_drama |
142
+ | 100 | poetry - prize - mondaymotivation - publication - apologize | 26 | 100_poetry_prize_mondaymotivation_publication |
143
+ | 101 | jobs - tcs - part - job - work | 25 | 101_jobs_tcs_part_job |
144
+ | 102 | card - credit - rewards - cash - tracking | 25 | 102_card_credit_rewards_cash |
145
+ | 103 | vlog - vlogs - dexerto - video - blog | 25 | 103_vlog_vlogs_dexerto_video |
146
+ | 104 | brother - 5½ - burge - poetry - thank | 25 | 104_brother_5½_burge_poetry |
147
+ | 105 | anime - manga - disney - animes - recap | 25 | 105_anime_manga_disney_animes |
148
+ | 106 | fox - news - msnbc - biden - business | 25 | 106_fox_news_msnbc_biden |
149
+ | 107 | thoreau - wildness - maldives - malé - wildlife | 24 | 107_thoreau_wildness_maldives_malé |
150
+ | 108 | condo - minutes - rent - condominium - เช | 24 | 108_condo_minutes_rent_condominium |
151
+ | 109 | freshworks - sales - requirements - job - development | 24 | 109_freshworks_sales_requirements_job |
152
+ | 110 | insurance - management - property - company - loans | 24 | 110_insurance_management_property_company |
153
+ | 111 | aew - wrestling - highlights - esports - impact | 23 | 111_aew_wrestling_highlights_esports |
154
+ | 112 | ctv - cbc - _x000d_ - news - bridge | 23 | 112_ctv_cbc__x000d__news |
155
+ | 113 | ukrainian - music - lyatoshynsky - solos - concert | 23 | 113_ukrainian_music_lyatoshynsky_solos |
156
+ | 114 | abc - ladzinski - campaign - carlton - news | 23 | 114_abc_ladzinski_campaign_carlton |
157
+ | 115 | gaming - pc - headset - byte - cosmic | 23 | 115_gaming_pc_headset_byte |
158
+ | 116 | climate - environmental - noaa - literacy - education | 23 | 116_climate_environmental_noaa_literacy |
159
+ | 117 | game - players - sonic - its - the | 22 | 117_game_players_sonic_its |
160
+ | 118 | olympic - olympics - chen - biles - medal | 22 | 118_olympic_olympics_chen_biles |
161
+ | 119 | loans - loan - student - paying - naira | 22 | 119_loans_loan_student_paying |
162
+ | 120 | nail - art - nails - compilation - acrylic | 22 | 120_nail_art_nails_compilation |
163
+ | 121 | peppa - pig - wolfoo - nguyen - favorite | 21 | 121_peppa_pig_wolfoo_nguyen |
164
+ | 122 | jazz - music - blues - heat - waves | 21 | 122_jazz_music_blues_heat |
165
+ | 123 | rónán - march - composer - lyricist - tickets | 21 | 123_rónán_march_composer_lyricist |
166
+ | 124 | olympic - beijing - olympics - china - athletes | 21 | 124_olympic_beijing_olympics_china |
167
+ | 125 | smoking - breakover - smokers - heart - hind | 21 | 125_smoking_breakover_smokers_heart |
168
+ | 126 | pets - animals - pet - panda - dog | 21 | 126_pets_animals_pet_panda |
169
+ | 127 | cycling - gcn - bike - feroce - wheels | 21 | 127_cycling_gcn_bike_feroce |
170
+ | 128 | musique - proposée - libre - par - la | 21 | 128_musique_proposée_libre_par |
171
+ | 129 | male - girlfriend - roseanne - unagi - twohill | 20 | 129_male_girlfriend_roseanne_unagi |
172
+ | 130 | gymnastics - moana - always - drugs - week | 20 | 130_gymnastics_moana_always_drugs |
173
+ | 131 | musk - gambling - twitter - elon - deduction | 20 | 131_musk_gambling_twitter_elon |
174
+ | 132 | lichfield - google - sat - stoke - mon | 20 | 132_lichfield_google_sat_stoke |
175
+ | 133 | reasonable - greenhouse - accommodation - robots - ai | 20 | 133_reasonable_greenhouse_accommodation_robots |
176
+ | 134 | icebox - maxo - theme - kream - koo | 19 | 134_icebox_maxo_theme_kream |
177
+ | 135 | whio - ong - ang - canal - birds | 19 | 135_whio_ong_ang_canal |
178
+ | 136 | codyfight - tattooing - brothers - marriage - extreme | 19 | 136_codyfight_tattooing_brothers_marriage |
179
+ | 137 | nuro - gm - vehicle - vehicles - electric | 19 | 137_nuro_gm_vehicle_vehicles |
180
+ | 138 | kcs - railroads - cn - rail - stb | 19 | 138_kcs_railroads_cn_rail |
181
+ | 139 | strengths - music - grief - leisure - life | 19 | 139_strengths_music_grief_leisure |
182
+ | 140 | drones - drone - uae - missile - dhabi | 19 | 140_drones_drone_uae_missile |
183
+ | 141 | massage - dubai - jumeirah - japanese - oil | 18 | 141_massage_dubai_jumeirah_japanese |
184
+ | 142 | bowl - super - bengals - bet - rams | 18 | 142_bowl_super_bengals_bet |
185
+ | 143 | pension - 9news - pensions - pay - tax | 18 | 143_pension_9news_pensions_pay |
186
+ | 144 | dog - toy - pet - supplies - toys | 18 | 144_dog_toy_pet_supplies |
187
+ | 145 | english - travellers - students - course - syllabus | 18 | 145_english_travellers_students_course |
188
+ | 146 | mentoring - cbs - mentor - mentors - teachers | 18 | 146_mentoring_cbs_mentor_mentors |
189
+ | 147 | picnic - park - blankets - basket - acompañantes | 18 | 147_picnic_park_blankets_basket |
190
+ | 148 | orig - 99 - amazon - prime - dollar | 18 | 148_orig_99_amazon_prime |
191
+ | 149 | primary - english - genetics - wilanów - education | 18 | 149_primary_english_genetics_wilanów |
192
+ | 150 | hardin - film - he - she - oscar | 17 | 150_hardin_film_he_she |
193
+ | 151 | laptop - gaming - alienware - laptops - hp | 17 | 151_laptop_gaming_alienware_laptops |
194
+ | 152 | ufc - tmz - owens - onlyfans - tonight | 17 | 152_ufc_tmz_owens_onlyfans |
195
+ | 153 | basketball - vs - varsity - darien - canaan | 17 | 153_basketball_vs_varsity_darien |
196
+ | 154 | workers - hanford - state - law - doe | 17 | 154_workers_hanford_state_law |
197
+ | 155 | cdl - freight - broker - logistics - eldt | 17 | 155_cdl_freight_broker_logistics |
198
+ | 156 | builders - connell - brenton - firm - wage | 17 | 156_builders_connell_brenton_firm |
199
+ | 157 | bookstore - easter - my - menger - eastershelfie | 16 | 157_bookstore_easter_my_menger |
200
+ | 158 | prince - royal - duke - charles - queen | 16 | 158_prince_royal_duke_charles |
201
+ | 159 | ดตามเราได - จำก - มหาชน - voicetv - oppday | 16 | 159_ดตามเราได_จำก_มหาชน_voicetv |
202
+ | 160 | nba - trades - stream - espn - live | 16 | 160_nba_trades_stream_espn |
203
+ | 161 | school - students - science - brandon - twig | 16 | 161_school_students_science_brandon |
204
+ | 162 | morning - sleep - your - kaplan - routine | 16 | 162_morning_sleep_your_kaplan |
205
+ | 163 | kat - author - desires - louise - charmaine | 16 | 163_kat_author_desires_louise |
206
+ | 164 | movie - recapped - uche - academia - dizzyeight | 16 | 164_movie_recapped_uche_academia |
207
+ | 165 | awka - religion - suspects - anambra - echeng | 15 | 165_awka_religion_suspects_anambra |
208
+ | 166 | wrc - f1 - rally - championship - formula1 | 15 | 166_wrc_f1_rally_championship |
209
+ | 167 | hillstream - algae - scape - goby - aquarium | 15 | 167_hillstream_algae_scape_goby |
210
+ | 168 | skin - filler - touche - éclat - dermal | 15 | 168_skin_filler_touche_éclat |
211
+ | 169 | pets - cats - hopkins - cat - niblo | 15 | 169_pets_cats_hopkins_cat |
212
+
213
+ </details>
214
+
215
+ ## Training hyperparameters
216
+
217
+ * calculate_probabilities: False
218
+ * language: None
219
+ * low_memory: False
220
+ * min_topic_size: 10
221
+ * n_gram_range: (1, 1)
222
+ * nr_topics: None
223
+ * seed_topic_list: None
224
+ * top_n_words: 10
225
+ * verbose: True
226
+ * zeroshot_min_similarity: 0.7
227
+ * zeroshot_topic_list: None
228
+
229
+ ## Framework versions
230
+
231
+ * Numpy: 1.23.5
232
+ * HDBSCAN: 0.8.33
233
+ * UMAP: 0.5.5
234
+ * Pandas: 1.5.3
235
+ * Scikit-Learn: 1.2.2
236
+ * Sentence-transformers: 2.2.2
237
+ * Transformers: 4.36.0
238
+ * Numba: 0.58.1
239
+ * Plotly: 5.15.0
240
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": null,
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": true,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null
16
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d72ed6bb936325fc58a2c01b5b70314996f20c6bb085a9916c325ac0d31286b9
3
+ size 5162328
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e196af4ebebcaf6d7abdf759b9a763cceaebf453011141f8a56a5e59073a9f0
3
+ size 262744
topics.json ADDED
The diff for this file is too large to render. See raw diff