topic_docs5000 / README.md
Kamaljp's picture
Add BERTopic model
8d22143
|
raw
history blame
3.52 kB
metadata
tags:
  - bertopic
library_name: bertopic
pipeline_tag: text-classification

topic_docs5000

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

Usage

To use this model, please install BERTopic:

pip install -U bertopic

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("Kamaljp/topic_docs5000")

topic_model.get_topic_info()

Topic overview

  • Number of topics: 30
  • Number of training documents: 5000
Click here for an overview of all topics.
Topic ID Topic Keywords Topic Frequency Label
-1 the - to - of - and - is 12 -1_the_to_of_and
0 the - in - to - he - game 1606 0_the_in_to_he
1 the - drive - to - with - for 450 1_the_drive_to_with
2 the - to - that - of - and 344 2_the_to_that_of
3 the - of - and - in - to 246 3_the_of_and_in
4 of - to - the - is - and 220 4_of_to_the_is
5 the - car - and - it - for 203 5_the_car_and_it
6 the - of - that - to - is 186 6_the_of_that_to
7 call - three - bittrolff - uhhhh - test 172 7_call_three_bittrolff_uhhhh
8 the - to - be - of - key 172 8_the_to_be_of
9 the - space - of - and - to 169 9_the_space_of_and
10 the - openwindows - to - window - and 169 10_the_openwindows_to_window
11 for - and - 100 - to - the 146 11_for_and_100_to
12 windows - dos - the - and - to 132 12_windows_dos_the_and
13 the - bike - to - my - was 105 13_the_bike_to_my
14 you - that - to - of - your 100 14_you_that_to_of
15 for - and - to - mail - send 100 15_for_and_to_mail
16 to - that - homosexual - of - is 94 16_to_that_homosexual_of
17 is - that - objective - of - science 66 17_is_that_objective_of
18 printer - fonts - deskjet - hp - the 56 18_printer_fonts_deskjet_hp
19 jpeg - image - gif - file - format 45 19_jpeg_image_gif_file
20 points - graeme - polygon - the - lines 44 20_points_graeme_polygon_the
21 radar - detector - detectors - is - the 28 21_radar_detector_detectors_is
22 hotel - dj - for - ticket - price 27 22_hotel_dj_for_ticket
23 insurance - health - private - the - and 26 23_insurance_health_private_the
24 water - battery - temperature - the - discharge 21 24_water_battery_temperature_the
25 oil - paint - it - wax - and 17 25_oil_paint_it_wax
26 drugs - cocaine - lsd - drug - license 16 26_drugs_cocaine_lsd_drug
27 motif - toolkit - cosecomplient - api - mean 15 27_motif_toolkit_cosecomplient_api
28 maxaxaxaxaxaxaxaxaxaxaxaxaxaxax - entry - entries - rules - we 13 28_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_entry_entries_rules

Training hyperparameters

  • calculate_probabilities: True
  • language: english
  • low_memory: False
  • min_topic_size: 10
  • n_gram_range: (1, 1)
  • nr_topics: 30
  • seed_topic_list: None
  • top_n_words: 10
  • verbose: True

Framework versions

  • Numpy: 1.22.4
  • HDBSCAN: 0.8.29
  • UMAP: 0.5.3
  • Pandas: 1.5.3
  • Scikit-Learn: 1.2.2
  • Sentence-transformers: 2.2.2
  • Transformers: 4.30.2
  • Numba: 0.56.4
  • Plotly: 5.13.1
  • Python: 3.10.12