BERTopic_ArXiv / README.md
MaartenGr's picture
Update README.md
43dd6fb
metadata
tags:
  - bertopic
library_name: bertopic
pipeline_tag: text-classification

BERTopic_ArXiv

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

This pre-trained model demonstrates the use of several representation models that can be used within BERTopic. This model was trained on ~30000 ArXiv abstracts with the following topic representation methods (bertopic.representation):

  • POS
  • KeyBERTInspired
  • MaximalMarginalRelevance
  • KeyBERT + MaximalMarginalRelevance
  • ChatGPT labels
  • ChatGPT summaries

An example of the default c-TF-IDF representations:

"multiaspect.png"

An example of labels generated by ChatGPT (gpt-3.5-turbo):

"multiaspect.png"

To generate these images, you can follow along with this tutorial: Open In Colab

Usage

To use this model, please install BERTopic:

pip install -U bertopic
pip install -U safetensors

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")

topic_model.get_topic_info()

To view all different topic representations (keywords, labels, summary, etc.) you can run the following:

>>> topic_model.get_topic(0, full=True)
{'Main': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'POS': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181],
  ['user', 0.008753551043296965]],
 'KeyBERTInspired': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'MMR': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'KeyBERT + MMR': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'OpenAI_Label': [['Challenges and Approaches in Developing Task-oriented Dialogue Systems',
   1]],
 'OpenAI_Summary': [['Task-oriented dialogue systems and their components, such as dialogue policy, natural language understanding, dialogue state tracking, response generation, and end-to-end training using neural networks. These components are crucial in assisting users to complete various activities such as booking tickets and restaurant reservations through spoken language understanding dialogue. The challenge lies in tracking dialogue states of multiple domains and obtaining annotations for training. Effective SLU is achieved by utilizing context from the prior dialogue history.',
   1]]}

Topic overview

  • Number of topics: 107
  • Number of training documents: 33189
Click here for an overview of all topics.
Topic ID Topic Keywords Topic Frequency Label
-1 language - models - model - data - based 20 -1_language_models_model_data
0 dialogue - dialog - response - responses - intent 14247 0_dialogue_dialog_response_responses
1 speech - asr - speech recognition - recognition - end 1833 1_speech_asr_speech recognition_recognition
2 tuning - tasks - prompt - models - language 1369 2_tuning_tasks_prompt_models
3 summarization - summaries - summary - abstractive - document 1109 3_summarization_summaries_summary_abstractive
4 question - answer - qa - answering - question answering 893 4_question_answer_qa_answering
5 sentiment - sentiment analysis - aspect - analysis - opinion 837 5_sentiment_sentiment analysis_aspect_analysis
6 clinical - medical - biomedical - notes - patient 691 6_clinical_medical_biomedical_notes
7 translation - nmt - machine translation - neural machine - neural machine translation 586 7_translation_nmt_machine translation_neural machine
8 generation - text generation - text - language generation - nlg 558 8_generation_text generation_text_language generation
9 hate - hate speech - offensive - speech - detection 484 9_hate_hate speech_offensive_speech
10 news - fake - fake news - stance - fact 455 10_news_fake_fake news_stance
11 relation - relation extraction - extraction - relations - entity 450 11_relation_relation extraction_extraction_relations
12 ner - named - named entity - entity - named entity recognition 376 12_ner_named_named entity_entity
13 parsing - parser - dependency - treebank - parsers 370 13_parsing_parser_dependency_treebank
14 event - temporal - events - event extraction - extraction 314 14_event_temporal_events_event extraction
15 emotion - emotions - multimodal - emotion recognition - emotional 300 15_emotion_emotions_multimodal_emotion recognition
16 word - embeddings - word embeddings - embedding - words 292 16_word_embeddings_word embeddings_embedding
17 explanations - explanation - rationales - rationale - interpretability 212 17_explanations_explanation_rationales_rationale
18 morphological - arabic - morphology - languages - inflection 204 18_morphological_arabic_morphology_languages
19 topic - topics - topic models - lda - topic modeling 200 19_topic_topics_topic models_lda
20 bias - gender - biases - gender bias - debiasing 195 20_bias_gender_biases_gender bias
21 law - frequency - zipf - words - length 185 21_law_frequency_zipf_words
22 legal - court - law - legal domain - case 182 22_legal_court_law_legal domain
23 adversarial - attacks - attack - adversarial examples - robustness 181 23_adversarial_attacks_attack_adversarial examples
24 commonsense - commonsense knowledge - reasoning - knowledge - commonsense reasoning 180 24_commonsense_commonsense knowledge_reasoning_knowledge
25 quantum - semantics - calculus - compositional - meaning 171 25_quantum_semantics_calculus_compositional
26 correction - error - error correction - grammatical - grammatical error 161 26_correction_error_error correction_grammatical
27 argument - arguments - argumentation - argumentative - mining 160 27_argument_arguments_argumentation_argumentative
28 sarcasm - humor - sarcastic - detection - humorous 157 28_sarcasm_humor_sarcastic_detection
29 coreference - resolution - coreference resolution - mentions - mention 156 29_coreference_resolution_coreference resolution_mentions
30 sense - word sense - wsd - word - disambiguation 153 30_sense_word sense_wsd_word
31 knowledge - knowledge graph - graph - link prediction - entities 149 31_knowledge_knowledge graph_graph_link prediction
32 parsing - semantic parsing - amr - semantic - parser 146 32_parsing_semantic parsing_amr_semantic
33 cross lingual - lingual - cross - transfer - languages 146 33_cross lingual_lingual_cross_transfer
34 mt - translation - qe - quality - machine translation 139 34_mt_translation_qe_quality
35 sql - text sql - queries - spider - schema 138 35_sql_text sql_queries_spider
36 classification - text classification - label - text - labels 136 36_classification_text classification_label_text
37 style - style transfer - transfer - text style - text style transfer 136 37_style_style transfer_transfer_text style
38 question - question generation - questions - answer - generation 129 38_question_question generation_questions_answer
39 authorship - authorship attribution - attribution - author - authors 127 39_authorship_authorship attribution_attribution_author
40 sentence - sentence embeddings - similarity - sts - sentence embedding 123 40_sentence_sentence embeddings_similarity_sts
41 code - identification - switching - cs - code switching 121 41_code_identification_switching_cs
42 story - stories - story generation - generation - storytelling 118 42_story_stories_story generation_generation
43 discourse - discourse relation - discourse relations - rst - discourse parsing 117 43_discourse_discourse relation_discourse relations_rst
44 code - programming - source code - code generation - programming languages 117 44_code_programming_source code_code generation
45 paraphrase - paraphrases - paraphrase generation - paraphrasing - generation 114 45_paraphrase_paraphrases_paraphrase generation_paraphrasing
46 agent - games - environment - instructions - agents 111 46_agent_games_environment_instructions
47 covid - covid 19 - 19 - tweets - pandemic 108 47_covid_covid 19_19_tweets
48 linking - entity linking - entity - el - entities 107 48_linking_entity linking_entity_el
49 poetry - poems - lyrics - poem - music 103 49_poetry_poems_lyrics_poem
50 image - captioning - captions - visual - caption 100 50_image_captioning_captions_visual
51 nli - entailment - inference - natural language inference - language inference 96 51_nli_entailment_inference_natural language inference
52 keyphrase - keyphrases - extraction - document - phrases 95 52_keyphrase_keyphrases_extraction_document
53 simplification - text simplification - ts - sentence - simplified 95 53_simplification_text simplification_ts_sentence
54 empathetic - emotion - emotional - empathy - emotions 95 54_empathetic_emotion_emotional_empathy
55 depression - mental - health - mental health - social media 93 55_depression_mental_health_mental health
56 segmentation - word segmentation - chinese - chinese word segmentation - chinese word 93 56_segmentation_word segmentation_chinese_chinese word segmentation
57 citation - scientific - papers - citations - scholarly 85 57_citation_scientific_papers_citations
58 agreement - syntactic - verb - grammatical - subject verb 85 58_agreement_syntactic_verb_grammatical
59 metaphor - literal - figurative - metaphors - idiomatic 83 59_metaphor_literal_figurative_metaphors
60 srl - semantic role - role labeling - semantic role labeling - role 82 60_srl_semantic role_role labeling_semantic role labeling
61 privacy - private - federated - privacy preserving - federated learning 82 61_privacy_private_federated_privacy preserving
62 change - semantic change - time - semantic - lexical semantic 82 62_change_semantic change_time_semantic
63 bilingual - lingual - cross lingual - cross - embeddings 80 63_bilingual_lingual_cross lingual_cross
64 political - media - news - bias - articles 77 64_political_media_news_bias
65 medical - qa - question - questions - clinical 75 65_medical_qa_question_questions
66 math - mathematical - math word - word problems - problems 73 66_math_mathematical_math word_word problems
67 financial - stock - market - price - news 69 67_financial_stock_market_price
68 table - tables - tabular - reasoning - qa 69 68_table_tables_tabular_reasoning
69 readability - complexity - assessment - features - reading 65 69_readability_complexity_assessment_features
70 layout - document - documents - document understanding - extraction 64 70_layout_document_documents_document understanding
71 brain - cognitive - reading - syntactic - language 62 71_brain_cognitive_reading_syntactic
72 sign - gloss - language - signed - language translation 61 72_sign_gloss_language_signed
73 vqa - visual - visual question - visual question answering - question 59 73_vqa_visual_visual question_visual question answering
74 biased - biases - spurious - nlp - debiasing 57 74_biased_biases_spurious_nlp
75 visual - dialogue - multimodal - image - dialog 55 75_visual_dialogue_multimodal_image
76 translation - machine translation - machine - smt - statistical 54 76_translation_machine translation_machine_smt
77 multimodal - visual - image - translation - machine translation 52 77_multimodal_visual_image_translation
78 geographic - location - geolocation - geo - locations 51 78_geographic_location_geolocation_geo
79 reasoning - prompting - llms - chain thought - chain 48 79_reasoning_prompting_llms_chain thought
80 essay - scoring - aes - essay scoring - essays 45 80_essay_scoring_aes_essay scoring
81 crisis - disaster - traffic - tweets - disasters 45 81_crisis_disaster_traffic_tweets
82 graph - text classification - text - gcn - classification 44 82_graph_text classification_text_gcn
83 annotation - tools - linguistic - resources - xml 43 83_annotation_tools_linguistic_resources
84 entity alignment - alignment - kgs - entity - ea 43 84_entity alignment_alignment_kgs_entity
85 personality - traits - personality traits - evaluative - text 42 85_personality_traits_personality traits_evaluative
86 ad - alzheimer - alzheimer disease - disease - speech 40 86_ad_alzheimer_alzheimer disease_disease
87 taxonomy - hypernymy - taxonomies - hypernym - hypernyms 39 87_taxonomy_hypernymy_taxonomies_hypernym
88 active learning - active - al - learning - uncertainty 37 88_active learning_active_al_learning
89 reviews - summaries - summarization - review - opinion 36 89_reviews_summaries_summarization_review
90 emoji - emojis - sentiment - message - anonymous 35 90_emoji_emojis_sentiment_message
91 table - table text - tables - table text generation - text generation 35 91_table_table text_tables_table text generation
92 domain - domain adaptation - adaptation - domains - source 35 92_domain_domain adaptation_adaptation_domains
93 alignment - word alignment - parallel - pairs - alignments 34 93_alignment_word alignment_parallel_pairs
94 indo - languages - indo european - names - family 34 94_indo_languages_indo european_names
95 patent - claim - claim generation - chemical - technical 32 95_patent_claim_claim generation_chemical
96 agents - emergent - communication - referential - games 32 96_agents_emergent_communication_referential
97 graph - amr - graph text - graphs - text generation 31 97_graph_amr_graph text_graphs
98 moral - ethical - norms - values - social 29 98_moral_ethical_norms_values
99 acronym - acronyms - abbreviations - abbreviation - disambiguation 27 99_acronym_acronyms_abbreviations_abbreviation
100 typing - entity typing - entity - type - types 27 100_typing_entity typing_entity_type
101 coherence - discourse - discourse coherence - coherence modeling - text 26 101_coherence_discourse_discourse coherence_coherence modeling
102 pos - taggers - tagging - tagger - pos tagging 25 102_pos_taggers_tagging_tagger
103 drug - social - social media - media - health 25 103_drug_social_social media_media
104 gender - translation - bias - gender bias - mt 24 104_gender_translation_bias_gender bias
105 job - resume - skills - skill - soft 21 105_job_resume_skills_skill

Training Procedure

The model was trained as follows:

from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import PartOfSpeech, KeyBERTInspired, MaximalMarginalRelevance, OpenAI

# Prepare sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20, verbose=True)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Summarization with ChatGPT
summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in the following format:
topic: <description>
"""
summarization_model = OpenAI(model="gpt-3.5-turbo", chat=True, prompt=summarization_prompt, nr_docs=5, exponential_backoff=True, diversity=0.1)

# Representation models
representation_models = {
    "POS": PartOfSpeech("en_core_web_lg"),
    "KeyBERTInspired": KeyBERTInspired(),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
    "KeyBERT + MMR": [KeyBERTInspired(), MaximalMarginalRelevance(diversity=0.3)],
    "OpenAI_Label": OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, diversity=0.1),
    "OpenAI_Summary": [KeyBERTInspired(), summarization_model],
}

# Fit BERTopic
topic_model= BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        representation_model=representation_models,
        verbose=True
).fit(docs)

Training hyperparameters

  • calculate_probabilities: False
  • language: None
  • low_memory: False
  • min_topic_size: 10
  • n_gram_range: (1, 1)
  • nr_topics: None
  • seed_topic_list: None
  • top_n_words: 10
  • verbose: True

Framework versions

  • Numpy: 1.22.4
  • HDBSCAN: 0.8.29
  • UMAP: 0.5.3
  • Pandas: 1.5.3
  • Scikit-Learn: 1.2.2
  • Sentence-transformers: 2.2.2
  • Transformers: 4.29.2
  • Numba: 0.56.4
  • Plotly: 5.13.1
  • Python: 3.10.11