Spaces:
Running
Running
seanpedrickcase
commited on
Commit
•
55f0ce3
1
Parent(s):
04a15c5
Can split passages into sentences. Improved embedding, LLM representation models, improved zero shot capabilities
Browse files- .dockerignore +24 -0
- .gitignore +2 -0
- README.md +2 -2
- app.py +51 -26
- funcs/anonymiser.py +1 -1
- funcs/bertopic_vis_documents.py +13 -5
- funcs/clean_funcs.py +12 -4
- funcs/embeddings.py +32 -9
- funcs/helper_functions.py +161 -72
- funcs/representation_model.py +39 -30
- funcs/topic_core_funcs.py +316 -134
- requirements.txt +4 -3
- requirements_gpu.txt +2 -2
.dockerignore
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.pyc
|
2 |
+
*.ipynb
|
3 |
+
*.zip
|
4 |
+
*.npz
|
5 |
+
*.csv
|
6 |
+
*.xlsx
|
7 |
+
*.xls
|
8 |
+
*.pkl
|
9 |
+
*.parquet
|
10 |
+
*.png
|
11 |
+
*.safetensors
|
12 |
+
*.json
|
13 |
+
*.html
|
14 |
+
*.log
|
15 |
+
*.spec
|
16 |
+
*.bin
|
17 |
+
.ipynb_checkpoints/*
|
18 |
+
old_code/*
|
19 |
+
model/*
|
20 |
+
output_model/*
|
21 |
+
data/*
|
22 |
+
build_deps/*
|
23 |
+
dist/*
|
24 |
+
build/*
|
.gitignore
CHANGED
@@ -1,5 +1,6 @@
|
|
1 |
*.pyc
|
2 |
*.ipynb
|
|
|
3 |
*.npz
|
4 |
*.csv
|
5 |
*.xlsx
|
@@ -12,6 +13,7 @@
|
|
12 |
*.html
|
13 |
*.log
|
14 |
*.spec
|
|
|
15 |
.ipynb_checkpoints/*
|
16 |
old_code/*
|
17 |
model/*
|
|
|
1 |
*.pyc
|
2 |
*.ipynb
|
3 |
+
*.zip
|
4 |
*.npz
|
5 |
*.csv
|
6 |
*.xlsx
|
|
|
13 |
*.html
|
14 |
*.log
|
15 |
*.spec
|
16 |
+
*.bin
|
17 |
.ipynb_checkpoints/*
|
18 |
old_code/*
|
19 |
model/*
|
README.md
CHANGED
@@ -14,8 +14,8 @@ license: apache-2.0
|
|
14 |
|
15 |
Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
|
16 |
|
17 |
-
Uses fast TF-IDF-based embeddings by default, which are fast but not
|
18 |
|
19 |
For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
|
20 |
|
21 |
-
I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
|
|
|
14 |
|
15 |
Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
|
16 |
|
17 |
+
Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
|
18 |
|
19 |
For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
|
20 |
|
21 |
+
I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose the passages.parquet file for download.
|
app.py
CHANGED
@@ -6,10 +6,14 @@ import gradio as gr
|
|
6 |
import pandas as pd
|
7 |
import numpy as np
|
8 |
|
9 |
-
from funcs.topic_core_funcs import pre_clean, extract_topics, reduce_outliers, represent_topics, visualise_topics, save_as_pytorch_model
|
10 |
-
from funcs.helper_functions import initial_file_load, custom_regex_load
|
11 |
from sklearn.feature_extraction.text import CountVectorizer
|
12 |
|
|
|
|
|
|
|
|
|
13 |
|
14 |
# Gradio app
|
15 |
|
@@ -17,6 +21,7 @@ block = gr.Blocks(theme = gr.themes.Base())
|
|
17 |
|
18 |
with block:
|
19 |
|
|
|
20 |
data_state = gr.State(pd.DataFrame())
|
21 |
embeddings_state = gr.State(np.array([]))
|
22 |
embeddings_type_state = gr.State("")
|
@@ -26,18 +31,20 @@ with block:
|
|
26 |
docs_state = gr.State()
|
27 |
data_file_name_no_ext_state = gr.State()
|
28 |
label_list_state = gr.State(pd.DataFrame())
|
29 |
-
vectoriser_state = gr.State(CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=
|
|
|
|
|
30 |
|
31 |
gr.Markdown(
|
32 |
"""
|
33 |
# Topic modeller
|
34 |
Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
|
35 |
|
36 |
-
Uses fast TF-IDF-based embeddings by default, which are fast but not
|
37 |
|
38 |
For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
|
39 |
|
40 |
-
I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
|
41 |
""")
|
42 |
|
43 |
with gr.Tab("Load files and find topics"):
|
@@ -48,23 +55,34 @@ with block:
|
|
48 |
|
49 |
with gr.Accordion("Clean data", open = False):
|
50 |
with gr.Row():
|
51 |
-
clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="
|
52 |
-
drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50
|
53 |
-
anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective
|
54 |
-
split_sentence_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Split
|
55 |
with gr.Row():
|
56 |
-
custom_regex = gr.UploadButton(label="Import custom regex file", file_count="multiple")
|
57 |
-
gr.Markdown("""Import custom regex - csv table with one column of regex patterns with no header. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
|
58 |
custom_regex_text = gr.Textbox(label="Custom regex load status")
|
59 |
clean_btn = gr.Button("Clean data")
|
60 |
|
61 |
with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
|
62 |
candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file. Currently not compatible with low-resource embeddings.")
|
63 |
-
|
|
|
|
|
|
|
64 |
|
65 |
with gr.Row():
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
with gr.Row():
|
70 |
topics_btn = gr.Button("Extract topics", variant="primary")
|
@@ -78,12 +96,12 @@ with block:
|
|
78 |
representation_type = gr.Dropdown(label = "Method for generating new topic labels", value="Default", choices=["Default", "MMR", "KeyBERT", "LLM"])
|
79 |
represent_llm_btn = gr.Button("Change topic labels")
|
80 |
with gr.Row():
|
81 |
-
reduce_outliers_btn = gr.Button("Reduce outliers")
|
82 |
save_pytorch_btn = gr.Button("Save model in Pytorch format")
|
83 |
|
84 |
with gr.Tab("Visualise"):
|
85 |
with gr.Row():
|
86 |
-
visualisation_type_radio = gr.Radio(label="Visualisation type", choices=["Topic document graph", "Hierarchical view"])
|
87 |
in_label = gr.Dropdown(choices=["Choose a column"], multiselect = True, label="Select column for labelling documents in output visualisations.")
|
88 |
sample_slide = gr.Slider(minimum = 0.01, maximum = 1, value = 0.1, step = 0.01, label = "Proportion of data points to show on output visualisations.")
|
89 |
legend_label = gr.Textbox(label="Custom legend column (optional, any column from the topic details output)", visible=False)
|
@@ -98,36 +116,43 @@ with block:
|
|
98 |
with gr.Tab("Options"):
|
99 |
with gr.Accordion("Data load and processing options", open = True):
|
100 |
with gr.Row():
|
101 |
-
seed_number = gr.Number(label="Random seed to use
|
102 |
calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
|
103 |
with gr.Row():
|
104 |
-
|
105 |
-
embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp for smaller files with less accuracy.", value="No", choices=["Yes", "No"])
|
106 |
-
with gr.Row():
|
107 |
return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])
|
108 |
save_topic_model = gr.Dropdown(label = "Save topic model to BERTopic format pkl file.", value="No", choices=["Yes", "No"])
|
109 |
|
110 |
# Load in data. Update column names dropdown when file uploaded
|
111 |
-
in_files.upload(fn=initial_file_load, inputs=[in_files], outputs=[in_colnames, in_label, data_state, output_single_text, topic_model_state, embeddings_state, data_file_name_no_ext_state, label_list_state])
|
|
|
|
|
|
|
112 |
|
113 |
# Clean data
|
114 |
custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
|
115 |
-
clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop, split_sentence_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state], api_name="clean")
|
|
|
|
|
|
|
116 |
|
117 |
# Extract topics
|
118 |
-
topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress,
|
119 |
|
120 |
# Reduce outliers
|
121 |
-
reduce_outliers_btn.click(fn=reduce_outliers, inputs=[topic_model_state, docs_state, embeddings_state, data_file_name_no_ext_state, assigned_topics_state, vectoriser_state, save_topic_model], outputs=[output_single_text, output_file, topic_model_state], api_name="reduce_outliers")
|
122 |
|
123 |
# Re-represent topic labels
|
124 |
-
represent_llm_btn.click(fn=represent_topics, inputs=[topic_model_state, docs_state, data_file_name_no_ext_state,
|
125 |
|
126 |
# Save in Pytorch format
|
127 |
save_pytorch_btn.click(fn=save_as_pytorch_model, inputs=[topic_model_state, data_file_name_no_ext_state], outputs=[output_single_text, output_file], api_name="pytorch_save")
|
128 |
|
129 |
# Visualise topics
|
130 |
-
plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state,
|
|
|
|
|
|
|
131 |
|
132 |
# Launch the Gradio app
|
133 |
if __name__ == "__main__":
|
|
|
6 |
import pandas as pd
|
7 |
import numpy as np
|
8 |
|
9 |
+
from funcs.topic_core_funcs import pre_clean, optimise_zero_shot, extract_topics, reduce_outliers, represent_topics, visualise_topics, save_as_pytorch_model, change_default_vis_col
|
10 |
+
from funcs.helper_functions import initial_file_load, custom_regex_load, ensure_output_folder_exists, output_folder, get_connection_params
|
11 |
from sklearn.feature_extraction.text import CountVectorizer
|
12 |
|
13 |
+
min_word_occurence_slider_default = 0.01
|
14 |
+
max_word_occurence_slider_default = 0.95
|
15 |
+
|
16 |
+
ensure_output_folder_exists()
|
17 |
|
18 |
# Gradio app
|
19 |
|
|
|
21 |
|
22 |
with block:
|
23 |
|
24 |
+
original_data_state = gr.State(pd.DataFrame())
|
25 |
data_state = gr.State(pd.DataFrame())
|
26 |
embeddings_state = gr.State(np.array([]))
|
27 |
embeddings_type_state = gr.State("")
|
|
|
31 |
docs_state = gr.State()
|
32 |
data_file_name_no_ext_state = gr.State()
|
33 |
label_list_state = gr.State(pd.DataFrame())
|
34 |
+
vectoriser_state = gr.State(CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=min_word_occurence_slider_default, max_df=max_word_occurence_slider_default))
|
35 |
+
|
36 |
+
session_hash_state = gr.State("")
|
37 |
|
38 |
gr.Markdown(
|
39 |
"""
|
40 |
# Topic modeller
|
41 |
Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
|
42 |
|
43 |
+
Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
|
44 |
|
45 |
For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
|
46 |
|
47 |
+
I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose the passages.parquet file for download.
|
48 |
""")
|
49 |
|
50 |
with gr.Tab("Load files and find topics"):
|
|
|
55 |
|
56 |
with gr.Accordion("Clean data", open = False):
|
57 |
with gr.Row():
|
58 |
+
clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove html, > 1 digit nums, emails, postcodes (UK).")
|
59 |
+
drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50 character strings.")
|
60 |
+
anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective and slow!")
|
61 |
+
split_sentence_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Split text into sentences. Useful for small datasets.")
|
62 |
with gr.Row():
|
63 |
+
custom_regex = gr.UploadButton(label="Import custom regex removal file", file_count="multiple")
|
64 |
+
gr.Markdown("""Import custom regex - csv table with one column of regex patterns with no header. Strings matching this pattern will be removed. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
|
65 |
custom_regex_text = gr.Textbox(label="Custom regex load status")
|
66 |
clean_btn = gr.Button("Clean data")
|
67 |
|
68 |
with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
|
69 |
candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file. Currently not compatible with low-resource embeddings.")
|
70 |
+
|
71 |
+
with gr.Row():
|
72 |
+
zero_shot_similarity = gr.Slider(minimum = 0.2, maximum = 1, value = 0.55, step = 0.001, label = "Minimum similarity value for document to be assigned to zero-shot topic. You may need to set this very low to get documents assigned to your topics!", scale=2)
|
73 |
+
zero_shot_optimiser_btn = gr.Button("Optimise settings to keep only zero-shot topics", scale=1)
|
74 |
|
75 |
with gr.Row():
|
76 |
+
with gr.Accordion("Topic modelling settings - change documents per topic, max topics, frequency of terms", open = False):
|
77 |
+
|
78 |
+
with gr.Row():
|
79 |
+
min_docs_slider = gr.Slider(minimum = 2, maximum = 1000, value = 3, step = 1, label = "Minimum number of similar documents needed to make a topic.")
|
80 |
+
max_topics_slider = gr.Slider(minimum = 2, maximum = 500, value = 100, step = 1, label = "Maximum number of topics")
|
81 |
+
with gr.Row():
|
82 |
+
min_word_occurence_slider = gr.Slider(minimum = 0.001, maximum = 0.9, value = min_word_occurence_slider_default, step = 0.001, label = "Keep terms that appear in this minimum proportion of documents. Avoids creating topics with very uncommon words.")
|
83 |
+
max_word_occurence_slider = gr.Slider(minimum = 0.1, maximum = 1.0, value =max_word_occurence_slider_default, step = 0.01, label = "Keep terms that appear in less than this maximum proportion of documents. Avoids very common words in topic names.")
|
84 |
+
|
85 |
+
quality_mode_drop = gr.Dropdown(label = "Use high-quality transformers-based embeddings (slower)", value="No", choices=["Yes", "No"])
|
86 |
|
87 |
with gr.Row():
|
88 |
topics_btn = gr.Button("Extract topics", variant="primary")
|
|
|
96 |
representation_type = gr.Dropdown(label = "Method for generating new topic labels", value="Default", choices=["Default", "MMR", "KeyBERT", "LLM"])
|
97 |
represent_llm_btn = gr.Button("Change topic labels")
|
98 |
with gr.Row():
|
99 |
+
reduce_outliers_btn = gr.Button("Reduce outliers (will create new topic labels)")
|
100 |
save_pytorch_btn = gr.Button("Save model in Pytorch format")
|
101 |
|
102 |
with gr.Tab("Visualise"):
|
103 |
with gr.Row():
|
104 |
+
visualisation_type_radio = gr.Radio(label="Visualisation type", choices=["Topic document graph", "Hierarchical view"], value="Topic document graph")
|
105 |
in_label = gr.Dropdown(choices=["Choose a column"], multiselect = True, label="Select column for labelling documents in output visualisations.")
|
106 |
sample_slide = gr.Slider(minimum = 0.01, maximum = 1, value = 0.1, step = 0.01, label = "Proportion of data points to show on output visualisations.")
|
107 |
legend_label = gr.Textbox(label="Custom legend column (optional, any column from the topic details output)", visible=False)
|
|
|
116 |
with gr.Tab("Options"):
|
117 |
with gr.Accordion("Data load and processing options", open = True):
|
118 |
with gr.Row():
|
119 |
+
seed_number = gr.Number(label="Random seed to use in processing", minimum=0, step=1, value=42, precision=0)
|
120 |
calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
|
121 |
with gr.Row():
|
122 |
+
embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp: smaller files but lower quality.", value="No", choices=["Yes", "No"])
|
|
|
|
|
123 |
return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])
|
124 |
save_topic_model = gr.Dropdown(label = "Save topic model to BERTopic format pkl file.", value="No", choices=["Yes", "No"])
|
125 |
|
126 |
# Load in data. Update column names dropdown when file uploaded
|
127 |
+
in_files.upload(fn=initial_file_load, inputs=[in_files], outputs=[in_colnames, in_label, data_state, output_single_text, topic_model_state, embeddings_state, data_file_name_no_ext_state, label_list_state, original_data_state])
|
128 |
+
|
129 |
+
# When topic modelling column is chosen, change the default visualisation column to the same
|
130 |
+
in_colnames.change(fn=change_default_vis_col, inputs=[in_colnames],outputs=[in_label])
|
131 |
|
132 |
# Clean data
|
133 |
custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
|
134 |
+
clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop, split_sentence_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state, embeddings_state], api_name="clean")
|
135 |
+
|
136 |
+
# Optimise for keeping only zero-shot topics
|
137 |
+
zero_shot_optimiser_btn.click(fn=optimise_zero_shot, outputs=[quality_mode_drop, min_docs_slider, max_topics_slider, min_word_occurence_slider, max_word_occurence_slider, zero_shot_similarity])
|
138 |
|
139 |
# Extract topics
|
140 |
+
topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress, quality_mode_drop, save_topic_model, embeddings_state, embeddings_type_state, zero_shot_similarity, calc_probs, vectoriser_state, min_word_occurence_slider, max_word_occurence_slider, split_sentence_drop, seed_number], outputs=[output_single_text, output_file, embeddings_state, embeddings_type_state, data_file_name_no_ext_state, topic_model_state, docs_state, vectoriser_state, assigned_topics_state], api_name="topics")
|
141 |
|
142 |
# Reduce outliers
|
143 |
+
reduce_outliers_btn.click(fn=reduce_outliers, inputs=[topic_model_state, docs_state, embeddings_state, data_file_name_no_ext_state, assigned_topics_state, vectoriser_state, save_topic_model, split_sentence_drop, data_state], outputs=[output_single_text, output_file, topic_model_state], api_name="reduce_outliers")
|
144 |
|
145 |
# Re-represent topic labels
|
146 |
+
represent_llm_btn.click(fn=represent_topics, inputs=[topic_model_state, docs_state, data_file_name_no_ext_state, quality_mode_drop, save_topic_model, representation_type, vectoriser_state, split_sentence_drop, data_state], outputs=[output_single_text, output_file, topic_model_state], api_name="represent_llm")
|
147 |
|
148 |
# Save in Pytorch format
|
149 |
save_pytorch_btn.click(fn=save_as_pytorch_model, inputs=[topic_model_state, data_file_name_no_ext_state], outputs=[output_single_text, output_file], api_name="pytorch_save")
|
150 |
|
151 |
# Visualise topics
|
152 |
+
plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state, quality_mode_drop, embeddings_state, in_label, in_colnames, legend_label, sample_slide, visualisation_type_radio, seed_number], outputs=[vis_output_single_text, out_plot_file, plot, plot_2], api_name="plot")
|
153 |
+
|
154 |
+
# Get session hash from connection parameters
|
155 |
+
block.load(get_connection_params, inputs=None, outputs=[session_hash_state])
|
156 |
|
157 |
# Launch the Gradio app
|
158 |
if __name__ == "__main__":
|
funcs/anonymiser.py
CHANGED
@@ -46,7 +46,7 @@ from presidio_anonymizer.entities import OperatorConfig
|
|
46 |
# Function to Split Text and Create DataFrame using SpaCy
|
47 |
def expand_sentences_spacy(df, colname, nlp=nlp):
|
48 |
expanded_data = []
|
49 |
-
df = df.reset_index(names='index')
|
50 |
for index, row in df.iterrows():
|
51 |
doc = nlp(row[colname])
|
52 |
for sent in doc.sents:
|
|
|
46 |
# Function to Split Text and Create DataFrame using SpaCy
|
47 |
def expand_sentences_spacy(df, colname, nlp=nlp):
|
48 |
expanded_data = []
|
49 |
+
df = df.drop('index', axis = 1, errors="ignore").reset_index(names='index')
|
50 |
for index, row in df.iterrows():
|
51 |
doc = nlp(row[colname])
|
52 |
for sent in doc.sents:
|
funcs/bertopic_vis_documents.py
CHANGED
@@ -22,7 +22,8 @@ from tqdm import tqdm
|
|
22 |
import itertools
|
23 |
import numpy as np
|
24 |
|
25 |
-
|
|
|
26 |
|
27 |
def visualize_documents_custom(topic_model,
|
28 |
docs: List[str],
|
@@ -168,16 +169,23 @@ def visualize_documents_custom(topic_model,
|
|
168 |
df["y"] = embeddings_2d[:, 1]
|
169 |
|
170 |
# Prepare text and names
|
|
|
171 |
if isinstance(custom_labels, str):
|
172 |
names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
|
173 |
names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
|
174 |
names = [label if len(label) < 30 else label[:27] + "..." for label in names]
|
175 |
elif topic_model.custom_labels_ is not None and custom_labels:
|
176 |
-
print("Using custom labels: ", topic_model.custom_labels_)
|
177 |
-
names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
|
|
|
|
|
|
|
178 |
else:
|
179 |
-
print("Not using custom labels")
|
180 |
-
|
|
|
|
|
|
|
181 |
|
182 |
#print(names)
|
183 |
|
|
|
22 |
import itertools
|
23 |
import numpy as np
|
24 |
|
25 |
+
|
26 |
+
# Following adapted from Bertopic original implementation here (Maarten Grootendorst): https://github.com/MaartenGr/BERTopic/blob/master/bertopic/plotting/_documents.py
|
27 |
|
28 |
def visualize_documents_custom(topic_model,
|
29 |
docs: List[str],
|
|
|
169 |
df["y"] = embeddings_2d[:, 1]
|
170 |
|
171 |
# Prepare text and names
|
172 |
+
trace_name_char_length = 60
|
173 |
if isinstance(custom_labels, str):
|
174 |
names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
|
175 |
names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
|
176 |
names = [label if len(label) < 30 else label[:27] + "..." for label in names]
|
177 |
elif topic_model.custom_labels_ is not None and custom_labels:
|
178 |
+
#print("Using custom labels: ", topic_model.custom_labels_)
|
179 |
+
#names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
|
180 |
+
# Limit label length to 100 chars
|
181 |
+
names = [label[:trace_name_char_length] for label in (topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics)]
|
182 |
+
|
183 |
else:
|
184 |
+
#print("Not using custom labels")
|
185 |
+
# Limit label length to 100 chars
|
186 |
+
names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3])[:trace_name_char_length] for topic in unique_topics]
|
187 |
+
|
188 |
+
#names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]
|
189 |
|
190 |
#print(names)
|
191 |
|
funcs/clean_funcs.py
CHANGED
@@ -23,19 +23,27 @@ def initial_clean(texts, custom_regex, progress=gr.Progress()):
|
|
23 |
text = text.str.replace_all(email_pattern_regex, ' ')
|
24 |
text = text.str.replace_all(nums_two_more_regex, ' ')
|
25 |
text = text.str.replace_all(postcode_pattern_regex, ' ')
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
# Allow for custom regex patterns to be removed
|
28 |
if len(custom_regex) > 0:
|
29 |
for pattern in custom_regex:
|
30 |
raw_string_pattern = r'{}'.format(pattern)
|
31 |
print("Removing regex pattern: ", raw_string_pattern)
|
32 |
-
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
37 |
|
38 |
-
return
|
39 |
|
40 |
def remove_hyphens(text_text):
|
41 |
return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
|
|
|
23 |
text = text.str.replace_all(email_pattern_regex, ' ')
|
24 |
text = text.str.replace_all(nums_two_more_regex, ' ')
|
25 |
text = text.str.replace_all(postcode_pattern_regex, ' ')
|
26 |
+
text = text.str.replace_all(multiple_spaces_regex, ' ')
|
27 |
+
|
28 |
+
text = text.to_list()
|
29 |
+
|
30 |
+
return text
|
31 |
+
|
32 |
+
def regex_clean(texts, custom_regex, progress=gr.Progress()):
|
33 |
+
texts = pl.Series(texts).str.strip_chars()
|
34 |
|
35 |
# Allow for custom regex patterns to be removed
|
36 |
if len(custom_regex) > 0:
|
37 |
for pattern in custom_regex:
|
38 |
raw_string_pattern = r'{}'.format(pattern)
|
39 |
print("Removing regex pattern: ", raw_string_pattern)
|
40 |
+
texts = texts.str.replace_all(raw_string_pattern, ' ')
|
41 |
|
42 |
+
texts = texts.str.replace_all(multiple_spaces_regex, ' ')
|
43 |
|
44 |
+
texts = texts.to_list()
|
45 |
|
46 |
+
return texts
|
47 |
|
48 |
def remove_hyphens(text_text):
|
49 |
return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
|
funcs/embeddings.py
CHANGED
@@ -1,15 +1,41 @@
|
|
1 |
import time
|
2 |
import numpy as np
|
3 |
-
from torch import cuda
|
4 |
|
5 |
-
|
|
|
|
|
6 |
|
|
|
|
|
7 |
if cuda.is_available():
|
8 |
torch_device = "gpu"
|
|
|
|
|
|
|
9 |
else:
|
10 |
torch_device = "cpu"
|
|
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
# If no embeddings found, make or load in
|
15 |
if embeddings_out.size == 0:
|
@@ -32,7 +58,7 @@ def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, em
|
|
32 |
|
33 |
# Custom model
|
34 |
# If on CPU, don't resort to embedding models
|
35 |
-
if
|
36 |
print("Creating simplified 'sparse' embeddings based on TfIDF")
|
37 |
|
38 |
# Fit the pipeline to the text data
|
@@ -41,13 +67,10 @@ def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, em
|
|
41 |
# Transform text data to embeddings
|
42 |
embeddings_out = embedding_model.transform(docs)
|
43 |
|
44 |
-
|
45 |
-
|
46 |
-
elif low_resource_mode_opt == "No":
|
47 |
print("Creating dense embeddings based on transformers model")
|
48 |
|
49 |
-
|
50 |
-
embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32, precision="int8") # For large
|
51 |
|
52 |
toc = time.perf_counter()
|
53 |
time_out = f"The embedding took {toc - tic:0.1f} seconds"
|
|
|
1 |
import time
|
2 |
import numpy as np
|
3 |
+
from torch import cuda, backends, version
|
4 |
|
5 |
+
# Check for torch cuda
|
6 |
+
# If you want to disable cuda for testing purposes
|
7 |
+
#os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
|
8 |
|
9 |
+
print("Is CUDA enabled? ", cuda.is_available())
|
10 |
+
print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
|
11 |
if cuda.is_available():
|
12 |
torch_device = "gpu"
|
13 |
+
print("Cuda version installed is: ", version.cuda)
|
14 |
+
high_quality_mode = "Yes"
|
15 |
+
#os.system("nvidia-smi")
|
16 |
else:
|
17 |
torch_device = "cpu"
|
18 |
+
high_quality_mode = "No"
|
19 |
|
20 |
+
print("Device used is: ", torch_device)
|
21 |
+
|
22 |
+
|
23 |
+
|
24 |
+
def make_or_load_embeddings(docs: list, file_list: list, embeddings_out: np.ndarray, embedding_model, embeddings_super_compress: str, high_quality_mode_opt: str) -> np.ndarray:
|
25 |
+
"""
|
26 |
+
Create or load embeddings for the given documents.
|
27 |
+
|
28 |
+
Args:
|
29 |
+
docs (list): List of documents to embed.
|
30 |
+
file_list (list): List of file names to check for existing embeddings.
|
31 |
+
embeddings_out (np.ndarray): Array to store the embeddings.
|
32 |
+
embedding_model: Model used to generate embeddings.
|
33 |
+
embeddings_super_compress (str): Option to super compress embeddings ("Yes" or "No").
|
34 |
+
high_quality_mode_opt (str): Option for high quality mode ("Yes" or "No").
|
35 |
+
|
36 |
+
Returns:
|
37 |
+
np.ndarray: The generated or loaded embeddings.
|
38 |
+
"""
|
39 |
|
40 |
# If no embeddings found, make or load in
|
41 |
if embeddings_out.size == 0:
|
|
|
58 |
|
59 |
# Custom model
|
60 |
# If on CPU, don't resort to embedding models
|
61 |
+
if high_quality_mode_opt == "No":
|
62 |
print("Creating simplified 'sparse' embeddings based on TfIDF")
|
63 |
|
64 |
# Fit the pipeline to the text data
|
|
|
67 |
# Transform text data to embeddings
|
68 |
embeddings_out = embedding_model.transform(docs)
|
69 |
|
70 |
+
elif high_quality_mode_opt == "Yes":
|
|
|
|
|
71 |
print("Creating dense embeddings based on transformers model")
|
72 |
|
73 |
+
embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32)#, precision="int8") # For large
|
|
|
74 |
|
75 |
toc = time.perf_counter()
|
76 |
time_out = f"The embedding took {toc - tic:0.1f} seconds"
|
funcs/helper_functions.py
CHANGED
@@ -10,33 +10,70 @@ import numpy as np
|
|
10 |
from bertopic import BERTopic
|
11 |
from datetime import datetime
|
12 |
|
|
|
|
|
13 |
today = datetime.now().strftime("%d%m%Y")
|
14 |
today_rev = datetime.now().strftime("%Y%m%d")
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
-
#
|
|
|
|
|
|
|
|
|
34 |
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
#
|
39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
def detect_file_type(filename):
|
42 |
"""Detect the file type based on its extension."""
|
@@ -130,7 +167,7 @@ def initial_file_load(in_file):
|
|
130 |
|
131 |
|
132 |
#The np.array([]) at the end is for clearing the embedding state when a new file is loaded
|
133 |
-
return gr.Dropdown(choices=concat_choices), gr.Dropdown(choices=concat_choices), df, output_text, topic_model, embeddings, data_file_name_no_ext, custom_labels
|
134 |
|
135 |
def custom_regex_load(in_file):
|
136 |
'''
|
@@ -157,8 +194,6 @@ def custom_regex_load(in_file):
|
|
157 |
|
158 |
return output_text, custom_regex
|
159 |
|
160 |
-
|
161 |
-
|
162 |
def get_file_path_end(file_path):
|
163 |
# First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
|
164 |
basename = os.path.basename(file_path)
|
@@ -177,15 +212,7 @@ def get_file_path_end_with_ext(file_path):
|
|
177 |
|
178 |
return filename_end
|
179 |
|
180 |
-
def dummy_function(in_colnames):
|
181 |
-
"""
|
182 |
-
A dummy function that exists just so that dropdown updates work correctly.
|
183 |
-
"""
|
184 |
-
return None
|
185 |
-
|
186 |
# Zip the above to export file
|
187 |
-
|
188 |
-
|
189 |
def zip_folder(folder_path, output_zip_file):
|
190 |
# Create a ZipFile object in write mode
|
191 |
with zipfile.ZipFile(output_zip_file, 'w', zipfile.ZIP_DEFLATED) as zipf:
|
@@ -215,59 +242,121 @@ def delete_files_in_folder(folder_path):
|
|
215 |
except Exception as e:
|
216 |
print(f"Failed to delete {file_path}. Reason: {e}")
|
217 |
|
218 |
-
|
219 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
220 |
|
221 |
-
|
222 |
-
|
223 |
-
topic_dets = topic_model.get_topic_info()
|
224 |
|
225 |
-
|
226 |
-
topic_det_output_name = "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
227 |
-
topic_dets.to_csv(topic_det_output_name)
|
228 |
-
output_list.append(topic_det_output_name)
|
229 |
|
230 |
-
|
231 |
-
|
232 |
-
|
233 |
-
progress(0.8, desc= "Saving output")
|
234 |
-
|
235 |
-
topic_det_output_name = "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
236 |
topic_dets.to_csv(topic_det_output_name)
|
237 |
output_list.append(topic_det_output_name)
|
238 |
|
239 |
-
|
240 |
-
doc_dets = topic_model.get_document_info(docs)[["Document", "Topic", "Name", "Probability", "Representative_document"]]
|
241 |
-
doc_dets.to_csv(doc_det_output_name)
|
242 |
-
output_list.append(doc_det_output_name)
|
243 |
|
244 |
-
|
245 |
-
topics_text_out_str = str(topic_dets["CustomName"])
|
246 |
-
else:
|
247 |
-
topics_text_out_str = str(topic_dets["Name"])
|
248 |
-
output_text = "Topics: " + topics_text_out_str
|
249 |
|
250 |
-
|
251 |
-
|
252 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
253 |
|
254 |
-
|
|
|
|
|
|
|
255 |
|
256 |
-
|
257 |
-
# Create the folder
|
258 |
-
os.makedirs(folder_path)
|
259 |
|
260 |
-
|
261 |
-
|
262 |
|
263 |
-
#
|
264 |
-
#
|
265 |
|
266 |
-
|
|
|
|
|
267 |
|
268 |
-
|
269 |
-
|
270 |
-
|
271 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
272 |
|
273 |
-
|
|
|
10 |
from bertopic import BERTopic
|
11 |
from datetime import datetime
|
12 |
|
13 |
+
from typing import List, Tuple
|
14 |
+
|
15 |
today = datetime.now().strftime("%d%m%Y")
|
16 |
today_rev = datetime.now().strftime("%Y%m%d")
|
17 |
|
18 |
+
def get_or_create_env_var(var_name:str, default_value:str) -> str:
|
19 |
+
# Get the environment variable if it exists
|
20 |
+
value = os.environ.get(var_name)
|
21 |
+
|
22 |
+
# If it doesn't exist, set it to the default value
|
23 |
+
if value is None:
|
24 |
+
os.environ[var_name] = default_value
|
25 |
+
value = default_value
|
26 |
+
|
27 |
+
return value
|
28 |
|
29 |
+
# Retrieving or setting output folder
|
30 |
+
env_var_name = 'GRADIO_OUTPUT_FOLDER'
|
31 |
+
default_value = 'output/'
|
32 |
+
|
33 |
+
output_folder = get_or_create_env_var(env_var_name, default_value)
|
34 |
+
print(f'The value of {env_var_name} is {output_folder}')
|
35 |
+
|
36 |
+
def ensure_output_folder_exists():
|
37 |
+
"""Checks if the 'output/' folder exists, creates it if not."""
|
38 |
+
|
39 |
+
folder_name = "output/"
|
40 |
+
|
41 |
+
if not os.path.exists(folder_name):
|
42 |
+
# Create the folder if it doesn't exist
|
43 |
+
os.makedirs(folder_name)
|
44 |
+
print(f"Created the 'output/' folder.")
|
45 |
+
else:
|
46 |
+
print(f"The 'output/' folder already exists.")
|
47 |
+
|
48 |
+
def get_connection_params(request: gr.Request):
|
49 |
+
'''
|
50 |
+
Get connection parameter values from request object.
|
51 |
+
'''
|
52 |
+
if request:
|
53 |
|
54 |
+
# print("Request headers dictionary:", request.headers)
|
55 |
+
# print("All host elements", request.client)
|
56 |
+
# print("IP address:", request.client.host)
|
57 |
+
# print("Query parameters:", dict(request.query_params))
|
58 |
+
print("Session hash:", request.session_hash)
|
59 |
|
60 |
+
if 'x-cognito-id' in request.headers:
|
61 |
+
out_session_hash = request.headers['x-cognito-id']
|
62 |
+
base_folder = "user-files/"
|
63 |
+
#print("Cognito ID found:", out_session_hash)
|
64 |
|
65 |
+
else:
|
66 |
+
out_session_hash = request.session_hash
|
67 |
+
base_folder = "temp-files/"
|
68 |
+
#print("Cognito ID not found. Using session hash as save folder.")
|
69 |
+
|
70 |
+
output_folder = base_folder + out_session_hash + "/"
|
71 |
+
#print("S3 output folder is: " + "s3://" + bucket_name + "/" + output_folder)
|
72 |
+
|
73 |
+
return out_session_hash
|
74 |
+
else:
|
75 |
+
print("No session parameters found.")
|
76 |
+
return ""
|
77 |
|
78 |
def detect_file_type(filename):
|
79 |
"""Detect the file type based on its extension."""
|
|
|
167 |
|
168 |
|
169 |
#The np.array([]) at the end is for clearing the embedding state when a new file is loaded
|
170 |
+
return gr.Dropdown(choices=concat_choices), gr.Dropdown(choices=concat_choices), df, output_text, topic_model, embeddings, data_file_name_no_ext, custom_labels, df
|
171 |
|
172 |
def custom_regex_load(in_file):
|
173 |
'''
|
|
|
194 |
|
195 |
return output_text, custom_regex
|
196 |
|
|
|
|
|
197 |
def get_file_path_end(file_path):
|
198 |
# First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
|
199 |
basename = os.path.basename(file_path)
|
|
|
212 |
|
213 |
return filename_end
|
214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
215 |
# Zip the above to export file
|
|
|
|
|
216 |
def zip_folder(folder_path, output_zip_file):
|
217 |
# Create a ZipFile object in write mode
|
218 |
with zipfile.ZipFile(output_zip_file, 'w', zipfile.ZIP_DEFLATED) as zipf:
|
|
|
242 |
except Exception as e:
|
243 |
print(f"Failed to delete {file_path}. Reason: {e}")
|
244 |
|
245 |
+
def save_topic_outputs(topic_model: BERTopic, data_file_name_no_ext: str, output_list: List[str], docs: List[str], save_topic_model: bool, prepared_docs: pd.DataFrame, split_sentence_drop: str, output_folder: str = output_folder, progress: gr.Progress = gr.Progress()) -> Tuple[List[str], str]:
|
246 |
+
"""
|
247 |
+
Save the outputs of a topic model to specified files.
|
248 |
+
|
249 |
+
Args:
|
250 |
+
topic_model (BERTopic): The topic model object.
|
251 |
+
data_file_name_no_ext (str): The base name of the data file without extension.
|
252 |
+
output_list (List[str]): List to store the output file names.
|
253 |
+
docs (List[str]): List of documents.
|
254 |
+
save_topic_model (bool): Flag to save the topic model.
|
255 |
+
prepared_docs (pd.DataFrame): DataFrame containing prepared documents.
|
256 |
+
split_sentence_drop (str): Option to split sentences ("Yes" or "No").
|
257 |
+
output_folder (str, optional): Folder to save the output files. Defaults to output_folder.
|
258 |
+
progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress().
|
259 |
+
|
260 |
+
Returns:
|
261 |
+
Tuple[List[str], str]: A tuple containing the list of output file names and a status message.
|
262 |
+
"""
|
263 |
|
264 |
+
progress(0.7, desc= "Checking data")
|
|
|
|
|
265 |
|
266 |
+
topic_dets = topic_model.get_topic_info()
|
|
|
|
|
|
|
267 |
|
268 |
+
if topic_dets.shape[0] == 1:
|
269 |
+
topic_det_output_name = output_folder + "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
|
|
|
|
|
|
|
|
270 |
topic_dets.to_csv(topic_det_output_name)
|
271 |
output_list.append(topic_det_output_name)
|
272 |
|
273 |
+
return output_list, "No topics found, original file returned"
|
|
|
|
|
|
|
274 |
|
275 |
+
progress(0.8, desc= "Saving output")
|
|
|
|
|
|
|
|
|
276 |
|
277 |
+
topic_det_output_name = output_folder + "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
278 |
+
topic_dets.to_csv(topic_det_output_name)
|
279 |
+
output_list.append(topic_det_output_name)
|
280 |
+
|
281 |
+
doc_det_output_name = output_folder + "doc_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
282 |
+
|
283 |
+
## Check that the following columns exist in the dataframe, keep only the ones that exist
|
284 |
+
columns_to_check = ["Document", "Topic", "Name", "Probability", "Representative_document"]
|
285 |
+
|
286 |
+
columns_found = [column for column in columns_to_check if column in topic_model.get_document_info(docs).columns]
|
287 |
+
doc_dets = topic_model.get_document_info(docs)[columns_found]
|
288 |
+
|
289 |
+
# If you have created a 'sentence split' dataset from the cleaning options, map these sentences back to the original document.
|
290 |
+
try:
|
291 |
+
if split_sentence_drop == "Yes":
|
292 |
+
doc_dets = doc_dets.merge(prepared_docs[['document_index']], how = "left", left_index=True, right_index=True)
|
293 |
+
doc_dets = doc_dets.rename(columns={"document_index": "parent_document_index"}, errors='ignore')
|
294 |
+
|
295 |
+
# 1. Group by Parent Document Index:
|
296 |
+
grouped = doc_dets.groupby('parent_document_index')
|
297 |
+
|
298 |
+
# 2. Aggregate Topics and Probabilities:
|
299 |
+
def aggregate_topics(group):
|
300 |
+
original_text = ' '.join(group['Document'])
|
301 |
+
topics = group['Topic'].tolist()
|
302 |
+
|
303 |
+
if 'Name' in group.columns:
|
304 |
+
topic_names = group['Name'].tolist()
|
305 |
+
else:
|
306 |
+
topic_names = None
|
307 |
|
308 |
+
if 'Probability' in group.columns:
|
309 |
+
probabilities = group['Probability'].tolist()
|
310 |
+
else:
|
311 |
+
probabilities = None # Or any other default value you prefer
|
312 |
|
313 |
+
return pd.Series({'Document':original_text, 'Topic numbers': topics, 'Topic names': topic_names, 'Probabilities': probabilities})
|
|
|
|
|
314 |
|
315 |
+
#result_df = grouped.apply(aggregate_topics).reset_index()
|
316 |
+
doc_det_agg = grouped.apply(lambda x: aggregate_topics(x)).reset_index()
|
317 |
|
318 |
+
# Join back original text
|
319 |
+
#doc_det_agg = doc_det_agg.merge(original_data[[in_colnames_list_first]], how = "left", left_index=True, right_index=True)
|
320 |
|
321 |
+
doc_det_agg_output_name = output_folder + "doc_details_agg_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
322 |
+
doc_det_agg.to_csv(doc_det_agg_output_name)
|
323 |
+
output_list.append(doc_det_agg_output_name)
|
324 |
|
325 |
+
except Exception as e:
|
326 |
+
print("Creating aggregate document details failed, error:", e)
|
327 |
+
|
328 |
+
# Save document details to file
|
329 |
+
doc_dets.to_csv(doc_det_output_name)
|
330 |
+
output_list.append(doc_det_output_name)
|
331 |
+
|
332 |
+
|
333 |
+
if "CustomName" in topic_dets.columns:
|
334 |
+
topics_text_out_str = str(topic_dets["CustomName"])
|
335 |
+
else:
|
336 |
+
topics_text_out_str = str(topic_dets["Name"])
|
337 |
+
output_text = "Topics: " + topics_text_out_str
|
338 |
+
|
339 |
+
# Save topic model to file
|
340 |
+
if save_topic_model == "Yes":
|
341 |
+
print("Saving BERTopic model in .pkl format.")
|
342 |
+
|
343 |
+
#folder_path = output_folder #"output_model/"
|
344 |
+
|
345 |
+
#if not os.path.exists(folder_path):
|
346 |
+
# Create the folder
|
347 |
+
# os.makedirs(folder_path)
|
348 |
+
|
349 |
+
topic_model_save_name_pkl = output_folder + data_file_name_no_ext + "_topics_" + today_rev + ".pkl"# + ".safetensors"
|
350 |
+
topic_model_save_name_zip = topic_model_save_name_pkl + ".zip"
|
351 |
+
|
352 |
+
# Clear folder before replacing files
|
353 |
+
#delete_files_in_folder(topic_model_save_name_pkl)
|
354 |
+
|
355 |
+
topic_model.save(topic_model_save_name_pkl, serialization='pickle', save_embedding_model=False, save_ctfidf=False)
|
356 |
+
|
357 |
+
# Zip file example
|
358 |
+
|
359 |
+
#zip_folder(topic_model_save_name_pkl, topic_model_save_name_zip)
|
360 |
+
output_list.append(topic_model_save_name_pkl)
|
361 |
|
362 |
+
return output_list, output_text
|
funcs/representation_model.py
CHANGED
@@ -3,29 +3,26 @@ from bertopic.representation import LlamaCPP
|
|
3 |
from llama_cpp import Llama
|
4 |
from pydantic import BaseModel
|
5 |
import torch.cuda
|
6 |
-
from huggingface_hub import hf_hub_download
|
7 |
|
8 |
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
|
9 |
-
from funcs.
|
10 |
-
|
11 |
-
random_seed = 42
|
12 |
|
13 |
chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
|
14 |
chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
|
15 |
|
|
|
16 |
|
17 |
# Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
|
18 |
-
|
19 |
-
|
20 |
low_resource_mode = "No"
|
21 |
-
n_gpu_layers =
|
22 |
-
else:
|
23 |
-
torch_device = "cpu"
|
24 |
low_resource_mode = "Yes"
|
25 |
n_gpu_layers = 0
|
26 |
|
27 |
-
#low_resource_mode = "No" # Override for testing
|
28 |
-
|
29 |
#print("Running on device:", torch_device)
|
30 |
n_threads = torch.get_num_threads()
|
31 |
print("CPU n_threads:", n_threads)
|
@@ -37,7 +34,7 @@ top_p: float = 1
|
|
37 |
repeat_penalty: float = 1.1
|
38 |
last_n_tokens_size: int = 128
|
39 |
max_tokens: int = 500
|
40 |
-
seed: int =
|
41 |
reset: bool = True
|
42 |
stream: bool = False
|
43 |
n_threads: int = n_threads
|
@@ -83,15 +80,25 @@ llm_config = LLamacppInitConfigGpu(last_n_tokens_size=last_n_tokens_size,
|
|
83 |
trust_remote_code=trust_remote_code)
|
84 |
|
85 |
## Create representation model parameters ##
|
86 |
-
# KeyBERT
|
87 |
keybert = KeyBERTInspired(random_state=random_seed)
|
88 |
-
# MMR
|
89 |
mmr = MaximalMarginalRelevance(diversity=0.5)
|
90 |
-
|
91 |
base_rep = BaseRepresentation()
|
92 |
|
93 |
# Find model file
|
94 |
-
def find_model_file(hf_model_name, hf_model_file, search_folder, sub_folder):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
hf_loc = search_folder #os.environ["HF_HOME"]
|
96 |
hf_sub_loc = search_folder + sub_folder #os.environ["HF_HOME"]
|
97 |
|
@@ -116,17 +123,27 @@ def find_model_file(hf_model_name, hf_model_file, search_folder, sub_folder):
|
|
116 |
|
117 |
return found_file
|
118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
|
120 |
-
|
|
|
|
|
121 |
|
122 |
if representation_type == "LLM":
|
123 |
print("Generating LLM representation")
|
124 |
# Use llama.cpp to load in model
|
125 |
|
126 |
-
# del os.environ["HF_HOME"]
|
127 |
-
|
128 |
# Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
|
129 |
-
# Get HF_HOME environment variable or default to "~/.cache/huggingface/hub"
|
130 |
base_folder = "model" #"~/.cache/huggingface/hub"
|
131 |
hf_home_value = os.getenv("HF_HOME", base_folder)
|
132 |
|
@@ -158,9 +175,10 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
|
|
158 |
|
159 |
print("Loading representation model with", llm_config.n_gpu_layers, "layers allocated to GPU.")
|
160 |
|
|
|
161 |
llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx,seed=seed) #**llm_config.model_dump())# rope_freq_scale=0.5,
|
162 |
#print(llm.n_gpu_layers)
|
163 |
-
print("Chosen prompt:", chosen_prompt)
|
164 |
llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
|
165 |
|
166 |
# All representation models
|
@@ -180,15 +198,6 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
|
|
180 |
else:
|
181 |
print("Generating default representation type")
|
182 |
representation_model = {"Default":base_rep}
|
183 |
-
|
184 |
-
# Deprecated example using CTransformers. This package is not really used anymore
|
185 |
-
#model = AutoModelForCausalLM.from_pretrained('NousResearch/Nous-Capybara-7B-V1.9-GGUF', model_type='mistral', model_file='Capybara-7B-V1.9-Q5_K_M.gguf', hf=True, **vars(llm_config))
|
186 |
-
#tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Capybara-7B-V1.9")
|
187 |
-
#generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
|
188 |
-
|
189 |
-
# Text generation with Llama 2
|
190 |
-
#mistral_capybara = TextGeneration(generator, prompt=capybara_prompt)
|
191 |
-
#mistral_hermes = TextGeneration(generator, prompt=open_hermes_prompt)
|
192 |
|
193 |
return representation_model
|
194 |
|
|
|
3 |
from llama_cpp import Llama
|
4 |
from pydantic import BaseModel
|
5 |
import torch.cuda
|
6 |
+
from huggingface_hub import hf_hub_download
|
7 |
|
8 |
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
|
9 |
+
from funcs.embeddings import torch_device
|
10 |
+
from funcs.prompts import phi3_prompt, phi3_start
|
|
|
11 |
|
12 |
chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
|
13 |
chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
|
14 |
|
15 |
+
random_seed = 42
|
16 |
|
17 |
# Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
|
18 |
+
print("torch device for representation functions:", torch_device)
|
19 |
+
if torch_device == "gpu":
|
20 |
low_resource_mode = "No"
|
21 |
+
n_gpu_layers = -1 # i.e. all
|
22 |
+
else: # torch_device = "cpu"
|
|
|
23 |
low_resource_mode = "Yes"
|
24 |
n_gpu_layers = 0
|
25 |
|
|
|
|
|
26 |
#print("Running on device:", torch_device)
|
27 |
n_threads = torch.get_num_threads()
|
28 |
print("CPU n_threads:", n_threads)
|
|
|
34 |
repeat_penalty: float = 1.1
|
35 |
last_n_tokens_size: int = 128
|
36 |
max_tokens: int = 500
|
37 |
+
seed: int = random_seed
|
38 |
reset: bool = True
|
39 |
stream: bool = False
|
40 |
n_threads: int = n_threads
|
|
|
80 |
trust_remote_code=trust_remote_code)
|
81 |
|
82 |
## Create representation model parameters ##
|
|
|
83 |
keybert = KeyBERTInspired(random_state=random_seed)
|
|
|
84 |
mmr = MaximalMarginalRelevance(diversity=0.5)
|
|
|
85 |
base_rep = BaseRepresentation()
|
86 |
|
87 |
# Find model file
|
88 |
+
def find_model_file(hf_model_name: str, hf_model_file: str, search_folder: str, sub_folder: str) -> str:
|
89 |
+
"""
|
90 |
+
Finds the specified model file within the given search folder and subfolder.
|
91 |
+
|
92 |
+
Args:
|
93 |
+
hf_model_name (str): The name of the Hugging Face model.
|
94 |
+
hf_model_file (str): The specific file name of the model to find.
|
95 |
+
search_folder (str): The base folder to start the search.
|
96 |
+
sub_folder (str): The subfolder within the search folder to look into.
|
97 |
+
|
98 |
+
Returns:
|
99 |
+
str: The path to the found model file, or None if the file is not found.
|
100 |
+
"""
|
101 |
+
|
102 |
hf_loc = search_folder #os.environ["HF_HOME"]
|
103 |
hf_sub_loc = search_folder + sub_folder #os.environ["HF_HOME"]
|
104 |
|
|
|
123 |
|
124 |
return found_file
|
125 |
|
126 |
+
def create_representation_model(representation_type: str, llm_config: dict, hf_model_name: str, hf_model_file: str, chosen_start_tag: str, low_resource_mode: bool) -> dict:
|
127 |
+
"""
|
128 |
+
Creates a representation model based on the specified type and configuration.
|
129 |
+
|
130 |
+
Args:
|
131 |
+
representation_type (str): The type of representation model to create (e.g., "LLM", "KeyBERT").
|
132 |
+
llm_config (dict): Configuration settings for the LLM model.
|
133 |
+
hf_model_name (str): The name of the Hugging Face model.
|
134 |
+
hf_model_file (str): The specific file name of the model to find.
|
135 |
+
chosen_start_tag (str): The start tag to use for the model.
|
136 |
+
low_resource_mode (bool): Whether to enable low resource mode.
|
137 |
|
138 |
+
Returns:
|
139 |
+
dict: A dictionary containing the created representation model.
|
140 |
+
"""
|
141 |
|
142 |
if representation_type == "LLM":
|
143 |
print("Generating LLM representation")
|
144 |
# Use llama.cpp to load in model
|
145 |
|
|
|
|
|
146 |
# Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
|
|
|
147 |
base_folder = "model" #"~/.cache/huggingface/hub"
|
148 |
hf_home_value = os.getenv("HF_HOME", base_folder)
|
149 |
|
|
|
175 |
|
176 |
print("Loading representation model with", llm_config.n_gpu_layers, "layers allocated to GPU.")
|
177 |
|
178 |
+
#llm_config.n_gpu_layers
|
179 |
llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx,seed=seed) #**llm_config.model_dump())# rope_freq_scale=0.5,
|
180 |
#print(llm.n_gpu_layers)
|
181 |
+
#print("Chosen prompt:", chosen_prompt)
|
182 |
llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
|
183 |
|
184 |
# All representation models
|
|
|
198 |
else:
|
199 |
print("Generating default representation type")
|
200 |
representation_model = {"Default":base_rep}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
201 |
|
202 |
return representation_model
|
203 |
|
funcs/topic_core_funcs.py
CHANGED
@@ -8,12 +8,17 @@ import numpy as np
|
|
8 |
import time
|
9 |
from bertopic import BERTopic
|
10 |
|
11 |
-
from
|
|
|
|
|
|
|
12 |
from funcs.anonymiser import expand_sentences_spacy
|
13 |
-
from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs
|
14 |
-
from funcs.embeddings import make_or_load_embeddings
|
15 |
from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
|
|
|
16 |
|
|
|
17 |
|
18 |
from sentence_transformers import SentenceTransformer
|
19 |
from sklearn.pipeline import make_pipeline
|
@@ -22,27 +27,10 @@ from sklearn.feature_extraction.text import TfidfVectorizer
|
|
22 |
import funcs.anonymiser as anon
|
23 |
from umap import UMAP
|
24 |
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
# Check for torch cuda
|
31 |
-
# If you want to disable cuda for testing purposes
|
32 |
-
#os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
|
33 |
-
|
34 |
-
print("Is CUDA enabled? ", cuda.is_available())
|
35 |
-
print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
|
36 |
-
if cuda.is_available():
|
37 |
-
torch_device = "gpu"
|
38 |
-
print("Cuda version installed is: ", version.cuda)
|
39 |
-
low_resource_mode = "No"
|
40 |
-
#os.system("nvidia-smi")
|
41 |
-
else:
|
42 |
-
torch_device = "cpu"
|
43 |
-
low_resource_mode = "Yes"
|
44 |
-
|
45 |
-
print("Device used is: ", torch_device)
|
46 |
|
47 |
today = datetime.now().strftime("%d%m%Y")
|
48 |
today_rev = datetime.now().strftime("%Y%m%d")
|
@@ -54,7 +42,35 @@ embeddings_name = "mixedbread-ai/mxbai-embed-large-v1" #"BAAI/large-small-en-v1.
|
|
54 |
hf_model_name = "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
|
55 |
hf_model_file = "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
|
56 |
|
57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
output_text = ""
|
60 |
output_list = []
|
@@ -64,7 +80,7 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
|
|
64 |
if not in_colnames:
|
65 |
error_message = "Please enter one column name to use for cleaning and finding topics."
|
66 |
print(error_message)
|
67 |
-
return error_message, None, data_file_name_no_ext, None, None
|
68 |
|
69 |
all_tic = time.perf_counter()
|
70 |
|
@@ -77,17 +93,23 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
|
|
77 |
clean_tic = time.perf_counter()
|
78 |
print("Starting data clean.")
|
79 |
|
80 |
-
|
81 |
|
82 |
-
if not
|
83 |
-
|
84 |
-
else:
|
85 |
-
data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], [])
|
86 |
|
87 |
clean_toc = time.perf_counter()
|
88 |
clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
|
89 |
print(clean_time_out)
|
90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
if drop_duplicate_text == "Yes":
|
92 |
progress(0.3, desc= "Drop duplicates - remove short texts")
|
93 |
|
@@ -104,7 +126,8 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
|
|
104 |
if anonymise_drop == "Yes":
|
105 |
progress(0.6, desc= "Anonymising data")
|
106 |
|
107 |
-
|
|
|
108 |
|
109 |
anon_tic = time.perf_counter()
|
110 |
|
@@ -120,17 +143,19 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
|
|
120 |
if sentence_split_drop == "Yes":
|
121 |
progress(0.6, desc= "Splitting text into sentences")
|
122 |
|
123 |
-
|
|
|
124 |
|
125 |
anon_tic = time.perf_counter()
|
126 |
|
127 |
data = expand_sentences_spacy(data, in_colnames_list_first)
|
128 |
-
data = data[data[in_colnames_list_first].str.len() >=
|
|
|
129 |
|
130 |
anon_toc = time.perf_counter()
|
131 |
time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
|
132 |
|
133 |
-
out_data_name = data_file_name_no_ext + "_" + today_rev + ".csv"
|
134 |
data.to_csv(out_data_name)
|
135 |
output_list.append(out_data_name)
|
136 |
|
@@ -140,14 +165,84 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
|
|
140 |
|
141 |
output_text = "Data clean completed."
|
142 |
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
147 |
all_tic = time.perf_counter()
|
148 |
|
149 |
progress(0, desc= "Loading data")
|
150 |
|
|
|
|
|
151 |
output_list = []
|
152 |
file_list = [string.name for string in in_files]
|
153 |
|
@@ -170,10 +265,9 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
170 |
# Check if embeddings are being loaded in
|
171 |
progress(0.2, desc= "Loading/creating embeddings")
|
172 |
|
173 |
-
print("Low resource mode: ", low_resource_mode)
|
174 |
|
175 |
-
if
|
176 |
-
print("Using high
|
177 |
|
178 |
# Define a list of possible local locations to search for the model
|
179 |
local_embeddings_locations = [
|
@@ -205,7 +299,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
205 |
embeddings_type_state = "large"
|
206 |
|
207 |
# UMAP model uses Bertopic defaults
|
208 |
-
umap_model = UMAP(n_neighbors=
|
209 |
|
210 |
else:
|
211 |
print("Choosing low resource TF-IDF model.")
|
@@ -223,9 +317,9 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
223 |
|
224 |
#umap_model = TruncatedSVD(n_components=5, random_state=random_seed)
|
225 |
# UMAP model uses Bertopic defaults
|
226 |
-
umap_model = UMAP(n_neighbors=
|
227 |
|
228 |
-
embeddings_out = make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress,
|
229 |
|
230 |
# This is saved as a Gradio state object
|
231 |
vectoriser_model = vectoriser_state
|
@@ -250,7 +344,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
250 |
|
251 |
if calc_probs == True:
|
252 |
topics_probs_out = pd.DataFrame(topic_model.probabilities_)
|
253 |
-
topics_probs_out_name = "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
254 |
topics_probs_out.to_csv(topics_probs_out_name)
|
255 |
output_list.append(topics_probs_out_name)
|
256 |
|
@@ -258,20 +352,24 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
258 |
print(error)
|
259 |
print(fail_error_message)
|
260 |
|
261 |
-
|
|
|
|
|
262 |
|
263 |
|
264 |
# Do this if you have pre-defined topics
|
265 |
else:
|
266 |
-
if
|
267 |
-
|
268 |
-
|
269 |
|
270 |
-
|
271 |
|
272 |
zero_shot_topics = read_file(candidate_topics.name)
|
273 |
zero_shot_topics_lower = list(zero_shot_topics.iloc[:, 0].str.lower())
|
274 |
|
|
|
|
|
275 |
|
276 |
try:
|
277 |
topic_model = BERTopic( embedding_model=embedding_model, #embedding_model_pipe, # for Jina
|
@@ -288,7 +386,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
288 |
|
289 |
if calc_probs == True:
|
290 |
topics_probs_out = pd.DataFrame(topic_model.probabilities_)
|
291 |
-
topics_probs_out_name = "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
292 |
topics_probs_out.to_csv(topics_probs_out_name)
|
293 |
output_list.append(topics_probs_out_name)
|
294 |
|
@@ -296,14 +394,14 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
296 |
print("An exception occurred:", error)
|
297 |
print(fail_error_message)
|
298 |
|
299 |
-
|
|
|
|
|
300 |
|
301 |
# For some reason, zero topic modelling exports assigned topics as a np.array instead of a list. Converting it back here.
|
302 |
if isinstance(assigned_topics, np.ndarray):
|
303 |
assigned_topics = assigned_topics.tolist()
|
304 |
|
305 |
-
|
306 |
-
|
307 |
# Zero shot modelling is a model merge, which wipes the c_tf_idf part of the resulting model completely. To get hierarchical modelling to work, we need to recreate this part of the model with the CountVectorizer options used to create the initial model. Since with zero shot, we are merging two models that have exactly the same set of documents, the vocubulary should be the same, and so recreating the cf_tf_idf component in this way shouldn't be a problem. Discussion here, and below based on Maarten's suggested code: https://github.com/MaartenGr/BERTopic/issues/1700
|
308 |
|
309 |
# Get document info
|
@@ -312,16 +410,12 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
312 |
documents_per_topic = doc_dets.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
|
313 |
|
314 |
# Assign CountVectorizer to merged model
|
315 |
-
|
316 |
topic_model.vectorizer_model = vectoriser_model
|
317 |
|
318 |
# Re-calculate c-TF-IDF
|
319 |
c_tf_idf, _ = topic_model._c_tf_idf(documents_per_topic)
|
320 |
topic_model.c_tf_idf_ = c_tf_idf
|
321 |
|
322 |
-
###
|
323 |
-
|
324 |
-
|
325 |
# Check we have topics
|
326 |
if not assigned_topics:
|
327 |
return "No topics found.", output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model,[]
|
@@ -329,8 +423,14 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
329 |
print("Topic model created.")
|
330 |
|
331 |
# Tidy up topic label format a bit to have commas and spaces by default
|
332 |
-
|
333 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
334 |
|
335 |
# Replace current topic labels if new ones loaded in
|
336 |
if not custom_labels_df.empty:
|
@@ -342,18 +442,18 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
342 |
print("Custom topics: ", topic_model.custom_labels_)
|
343 |
|
344 |
# Outputs
|
345 |
-
output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
|
346 |
|
347 |
# If you want to save your embedding files
|
348 |
if return_intermediate_files == "Yes":
|
349 |
print("Saving embeddings to file")
|
350 |
-
if
|
351 |
-
embeddings_file_name = data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
|
352 |
else:
|
353 |
if embeddings_super_compress == "No":
|
354 |
-
embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings.npz'
|
355 |
else:
|
356 |
-
embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings_compress.npz'
|
357 |
|
358 |
np.savez_compressed(embeddings_file_name, embeddings_out)
|
359 |
|
@@ -365,7 +465,25 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
|
|
365 |
|
366 |
return output_text, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model, assigned_topics
|
367 |
|
368 |
-
def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, assigned_topics, vectoriser_model, save_topic_model, progress=gr.Progress(track_tqdm=True)):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
369 |
|
370 |
progress(0, desc= "Preparing data")
|
371 |
|
@@ -373,13 +491,9 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
|
|
373 |
|
374 |
all_tic = time.perf_counter()
|
375 |
|
376 |
-
# This step not necessary?
|
377 |
-
#assigned_topics, probs = topic_model.fit_transform(docs, embeddings_out)
|
378 |
-
|
379 |
if isinstance(assigned_topics, np.ndarray):
|
380 |
assigned_topics = assigned_topics.tolist()
|
381 |
|
382 |
-
|
383 |
# Reduce outliers if required, then update representation
|
384 |
progress(0.2, desc= "Reducing outliers")
|
385 |
print("Reducing outliers.")
|
@@ -397,20 +511,9 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
|
|
397 |
|
398 |
print("Finished reducing outliers.")
|
399 |
|
400 |
-
#progress(0.7, desc= "Replacing topic names with LLMs if necessary")
|
401 |
-
|
402 |
-
#topic_dets = topic_model.get_topic_info()
|
403 |
-
|
404 |
-
# # Replace original labels with LLM labels
|
405 |
-
# if "LLM" in topic_model.get_topic_info().columns:
|
406 |
-
# llm_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["LLM"].values()]
|
407 |
-
# topic_model.set_topic_labels(llm_labels)
|
408 |
-
# else:
|
409 |
-
# topic_model.set_topic_labels(list(topic_dets["Name"]))
|
410 |
-
|
411 |
# Outputs
|
412 |
progress(0.9, desc= "Saving to file")
|
413 |
-
output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
|
414 |
|
415 |
all_toc = time.perf_counter()
|
416 |
time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
|
@@ -418,16 +521,35 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
|
|
418 |
|
419 |
return output_text, output_list, topic_model
|
420 |
|
421 |
-
def represent_topics(topic_model, docs, data_file_name_no_ext,
|
422 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
423 |
|
424 |
output_list = []
|
425 |
|
426 |
all_tic = time.perf_counter()
|
427 |
|
428 |
-
|
|
|
|
|
429 |
|
430 |
-
representation_model = create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag,
|
431 |
|
432 |
progress(0.3, desc= "Updating existing topics")
|
433 |
topic_model.update_topics(docs, vectorizer_model=vectoriser_model, representation_model=representation_model)
|
@@ -439,7 +561,7 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
|
|
439 |
llm_labels = [label[0].split("\n")[0] for label in topic_dets["LLM"]]
|
440 |
topic_model.set_topic_labels(llm_labels)
|
441 |
|
442 |
-
label_list_file_name = data_file_name_no_ext + '_llm_topic_list_' + today_rev + '.csv'
|
443 |
|
444 |
llm_labels_df = pd.DataFrame(data={"Label":llm_labels})
|
445 |
llm_labels_df.to_csv(label_list_file_name, index=None)
|
@@ -452,7 +574,7 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
|
|
452 |
|
453 |
# Outputs
|
454 |
progress(0.8, desc= "Saving outputs")
|
455 |
-
output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
|
456 |
|
457 |
all_toc = time.perf_counter()
|
458 |
time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
|
@@ -460,11 +582,51 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
|
|
460 |
|
461 |
return output_text, output_list, topic_model
|
462 |
|
463 |
-
def visualise_topics(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
464 |
|
465 |
progress(0, desc= "Preparing data for visualisation")
|
466 |
|
467 |
output_list = []
|
|
|
468 |
vis_tic = time.perf_counter()
|
469 |
|
470 |
|
@@ -500,30 +662,37 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
|
|
500 |
topic_model.set_topic_labels(labels)
|
501 |
|
502 |
# Pre-reduce embeddings for visualisation purposes
|
503 |
-
if
|
504 |
-
reduced_embeddings = UMAP(n_neighbors=
|
505 |
else:
|
506 |
reduced_embeddings = TruncatedSVD(2, random_state=random_seed).fit_transform(embeddings_out)
|
507 |
|
508 |
-
progress(0.
|
509 |
# Visualise the topics:
|
510 |
|
511 |
-
print("Creating
|
512 |
-
|
513 |
-
# "Topic document graph", "Hierarchical view"
|
514 |
|
515 |
if visualisation_type_radio == "Topic document graph":
|
516 |
-
|
|
|
517 |
|
518 |
-
|
519 |
-
|
520 |
-
|
|
|
|
|
|
|
|
|
521 |
|
522 |
-
|
|
|
523 |
|
524 |
-
|
525 |
-
|
526 |
-
|
|
|
|
|
|
|
527 |
|
528 |
elif visualisation_type_radio == "Hierarchical view":
|
529 |
|
@@ -532,7 +701,7 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
|
|
532 |
# Print topic tree - may get encoding errors, so doing try except
|
533 |
try:
|
534 |
tree = topic_model.get_topic_tree(hierarchical_topics, tight_layout = True)
|
535 |
-
tree_name = data_file_name_no_ext + '_' + 'vis_hierarchy_tree_' + today_rev + '.txt'
|
536 |
|
537 |
with open(tree_name, "w") as file:
|
538 |
# Write the string to the file
|
@@ -540,59 +709,71 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
|
|
540 |
|
541 |
output_list.append(tree_name)
|
542 |
|
543 |
-
except Exception as
|
544 |
-
|
|
|
|
|
545 |
|
546 |
|
547 |
# Save new hierarchical topic model to file
|
548 |
-
|
549 |
-
|
550 |
-
|
551 |
-
|
552 |
-
|
553 |
-
|
554 |
-
|
555 |
-
|
|
|
|
|
|
|
|
|
556 |
|
557 |
# Write hierarchical topics levels to df
|
558 |
-
hierarchy_df_name = data_file_name_no_ext + '_' + 'hierarchy_topics_df_' + today_rev + '.csv'
|
559 |
hierarchy_df.to_csv(hierarchy_df_name, index = None)
|
560 |
output_list.append(hierarchy_df_name)
|
561 |
|
562 |
# Write hierarchical topics names to df
|
563 |
-
hierarchy_topic_names_name = data_file_name_no_ext + '_' + 'hierarchy_topics_names_' + today_rev + '.csv'
|
564 |
hierarchy_topic_names.to_csv(hierarchy_topic_names_name, index = None)
|
565 |
output_list.append(hierarchy_topic_names_name)
|
566 |
|
567 |
-
#except:
|
568 |
-
# error_message = "Visualisation preparation failed. Perhaps you need more topics to create the full hierarchy (more than 10)?"
|
569 |
-
# return error_message, output_list, None, None
|
570 |
|
571 |
-
topics_vis_name = data_file_name_no_ext + '_' + 'vis_hierarchy_topic_doc_' + today_rev + '.html'
|
572 |
topics_vis.write_html(topics_vis_name)
|
573 |
output_list.append(topics_vis_name)
|
574 |
|
575 |
-
topics_vis_2_name = data_file_name_no_ext + '_' + 'vis_hierarchy_' + today_rev + '.html'
|
576 |
topics_vis_2.write_html(topics_vis_2_name)
|
577 |
output_list.append(topics_vis_2_name)
|
578 |
|
579 |
all_toc = time.perf_counter()
|
580 |
-
|
581 |
-
print(
|
|
|
|
|
582 |
|
583 |
-
|
|
|
|
|
584 |
|
585 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
586 |
|
587 |
if not topic_model:
|
588 |
-
|
|
|
589 |
|
590 |
progress(0, desc= "Saving topic model in Pytorch format")
|
591 |
|
592 |
-
|
593 |
-
|
594 |
-
|
595 |
-
topic_model_save_name_folder = "output_model/" + data_file_name_no_ext + "_topics_" + today_rev# + ".safetensors"
|
596 |
topic_model_save_name_zip = topic_model_save_name_folder + ".zip"
|
597 |
|
598 |
# Clear folder before replacing files
|
@@ -600,9 +781,10 @@ def save_as_pytorch_model(topic_model, data_file_name_no_ext , progress=gr.Progr
|
|
600 |
|
601 |
topic_model.save(topic_model_save_name_folder, serialization='pytorch', save_embedding_model=True, save_ctfidf=False)
|
602 |
|
603 |
-
# Zip file example
|
604 |
-
|
605 |
zip_folder(topic_model_save_name_folder, topic_model_save_name_zip)
|
606 |
output_list.append(topic_model_save_name_zip)
|
607 |
|
608 |
-
|
|
|
|
|
|
8 |
import time
|
9 |
from bertopic import BERTopic
|
10 |
|
11 |
+
from typing import List, Type, Union
|
12 |
+
PandasDataFrame = Type[pd.DataFrame]
|
13 |
+
|
14 |
+
from funcs.clean_funcs import initial_clean, regex_clean
|
15 |
from funcs.anonymiser import expand_sentences_spacy
|
16 |
+
from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs, output_folder
|
17 |
+
from funcs.embeddings import make_or_load_embeddings, torch_device
|
18 |
from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
|
19 |
+
from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag, random_seed
|
20 |
|
21 |
+
from sklearn.feature_extraction.text import CountVectorizer
|
22 |
|
23 |
from sentence_transformers import SentenceTransformer
|
24 |
from sklearn.pipeline import make_pipeline
|
|
|
27 |
import funcs.anonymiser as anon
|
28 |
from umap import UMAP
|
29 |
|
30 |
+
# Default options can be changed in number selection on options page
|
31 |
+
umap_n_neighbours = 15
|
32 |
+
umap_min_dist = 0.0
|
33 |
+
umap_metric = 'cosine'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
today = datetime.now().strftime("%d%m%Y")
|
36 |
today_rev = datetime.now().strftime("%Y%m%d")
|
|
|
42 |
hf_model_name = "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
|
43 |
hf_model_file = "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
|
44 |
|
45 |
+
# When topic modelling column is chosen, change the default visualisation column to the same
|
46 |
+
def change_default_vis_col(in_colnames:List[str]):
|
47 |
+
'''
|
48 |
+
When topic modelling column is chosen, change the default visualisation column to the same
|
49 |
+
'''
|
50 |
+
if in_colnames:
|
51 |
+
return gr.Dropdown(value=in_colnames[0])
|
52 |
+
else:
|
53 |
+
return gr.Dropdown()
|
54 |
+
|
55 |
+
def pre_clean(data: pd.DataFrame, in_colnames: list, data_file_name_no_ext: str, custom_regex: pd.DataFrame, clean_text: str, drop_duplicate_text: str, anonymise_drop: str, sentence_split_drop: str, embeddings_state: dict, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
|
56 |
+
"""
|
57 |
+
Pre-processes the input data by cleaning text, removing duplicates, anonymizing data, and splitting sentences based on the provided options.
|
58 |
+
|
59 |
+
Args:
|
60 |
+
data (pd.DataFrame): The input data to be cleaned.
|
61 |
+
in_colnames (list): List of column names to be used for cleaning and finding topics.
|
62 |
+
data_file_name_no_ext (str): The base name of the data file without extension.
|
63 |
+
custom_regex (pd.DataFrame): Custom regex patterns for initial cleaning.
|
64 |
+
clean_text (str): Option to clean text ("Yes" or "No").
|
65 |
+
drop_duplicate_text (str): Option to drop duplicate text ("Yes" or "No").
|
66 |
+
anonymise_drop (str): Option to anonymize data ("Yes" or "No").
|
67 |
+
sentence_split_drop (str): Option to split text into sentences ("Yes" or "No").
|
68 |
+
embeddings_state (dict): State of the embeddings.
|
69 |
+
progress (gr.Progress, optional): Progress tracker for the cleaning process.
|
70 |
+
|
71 |
+
Returns:
|
72 |
+
tuple: A tuple containing the error message (if any), cleaned data, updated file name, and embeddings state.
|
73 |
+
"""
|
74 |
|
75 |
output_text = ""
|
76 |
output_list = []
|
|
|
80 |
if not in_colnames:
|
81 |
error_message = "Please enter one column name to use for cleaning and finding topics."
|
82 |
print(error_message)
|
83 |
+
return error_message, None, data_file_name_no_ext, None, None, embeddings_state
|
84 |
|
85 |
all_tic = time.perf_counter()
|
86 |
|
|
|
93 |
clean_tic = time.perf_counter()
|
94 |
print("Starting data clean.")
|
95 |
|
96 |
+
data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], [])
|
97 |
|
98 |
+
if '_clean' not in data_file_name_no_ext:
|
99 |
+
data_file_name_no_ext = data_file_name_no_ext + "_clean"
|
|
|
|
|
100 |
|
101 |
clean_toc = time.perf_counter()
|
102 |
clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
|
103 |
print(clean_time_out)
|
104 |
|
105 |
+
# Clean custom regex if exists
|
106 |
+
if not custom_regex.empty:
|
107 |
+
data[in_colnames_list_first] = regex_clean(data[in_colnames_list_first], custom_regex.iloc[:, 0].to_list())
|
108 |
+
|
109 |
+
if '_clean' not in data_file_name_no_ext:
|
110 |
+
data_file_name_no_ext = data_file_name_no_ext + "_clean"
|
111 |
+
|
112 |
+
|
113 |
if drop_duplicate_text == "Yes":
|
114 |
progress(0.3, desc= "Drop duplicates - remove short texts")
|
115 |
|
|
|
126 |
if anonymise_drop == "Yes":
|
127 |
progress(0.6, desc= "Anonymising data")
|
128 |
|
129 |
+
if '_anon' not in data_file_name_no_ext:
|
130 |
+
data_file_name_no_ext = data_file_name_no_ext + "_anon"
|
131 |
|
132 |
anon_tic = time.perf_counter()
|
133 |
|
|
|
143 |
if sentence_split_drop == "Yes":
|
144 |
progress(0.6, desc= "Splitting text into sentences")
|
145 |
|
146 |
+
if '_split' not in data_file_name_no_ext:
|
147 |
+
data_file_name_no_ext = data_file_name_no_ext + "_split"
|
148 |
|
149 |
anon_tic = time.perf_counter()
|
150 |
|
151 |
data = expand_sentences_spacy(data, in_colnames_list_first)
|
152 |
+
data = data[data[in_colnames_list_first].str.len() >= 25] # Keep only rows with at least 25 characters
|
153 |
+
data.reset_index(inplace=True, drop=True)
|
154 |
|
155 |
anon_toc = time.perf_counter()
|
156 |
time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
|
157 |
|
158 |
+
out_data_name = output_folder + data_file_name_no_ext + "_" + today_rev + ".csv"
|
159 |
data.to_csv(out_data_name)
|
160 |
output_list.append(out_data_name)
|
161 |
|
|
|
165 |
|
166 |
output_text = "Data clean completed."
|
167 |
|
168 |
+
# Overwrite existing embeddings as they will likely have changed
|
169 |
+
return output_text, output_list, data, data_file_name_no_ext, np.array([])
|
170 |
+
|
171 |
+
def optimise_zero_shot():
|
172 |
+
"""
|
173 |
+
Return options that optimise the topic model to keep only zero-shot topics as the main topics
|
174 |
+
"""
|
175 |
+
return gr.Dropdown(value="Yes"), gr.Slider(value=2), gr.Slider(value=2), gr.Slider(value=0.01), gr.Slider(value=0.95), gr.Slider(value=0.55)
|
176 |
+
|
177 |
+
def extract_topics(
|
178 |
+
data: pd.DataFrame,
|
179 |
+
in_files: list,
|
180 |
+
min_docs_slider: int,
|
181 |
+
in_colnames: list,
|
182 |
+
max_topics_slider: int,
|
183 |
+
candidate_topics: list,
|
184 |
+
data_file_name_no_ext: str,
|
185 |
+
custom_labels_df: pd.DataFrame,
|
186 |
+
return_intermediate_files: str,
|
187 |
+
embeddings_super_compress: str,
|
188 |
+
high_quality_mode: str,
|
189 |
+
save_topic_model: str,
|
190 |
+
embeddings_out: np.ndarray,
|
191 |
+
embeddings_type_state: str,
|
192 |
+
zero_shot_similarity: float,
|
193 |
+
calc_probs: str,
|
194 |
+
vectoriser_state: CountVectorizer,
|
195 |
+
min_word_occurence_slider: float,
|
196 |
+
max_word_occurence_slider: float,
|
197 |
+
split_sentence_drop: str,
|
198 |
+
random_seed: int = random_seed,
|
199 |
+
output_folder: str = output_folder,
|
200 |
+
umap_n_neighbours:int = umap_n_neighbours,
|
201 |
+
umap_min_dist:float = umap_min_dist,
|
202 |
+
umap_metric:str = umap_metric,
|
203 |
+
progress: gr.Progress = gr.Progress(track_tqdm=True)
|
204 |
+
) -> tuple:
|
205 |
+
"""
|
206 |
+
Extract topics from the given data using various parameters and settings.
|
207 |
+
|
208 |
+
Args:
|
209 |
+
data (pd.DataFrame): The input data.
|
210 |
+
in_files (list): List of input files.
|
211 |
+
min_docs_slider (int): Minimum number of similar documents needed to make a topic.
|
212 |
+
in_colnames (list): List of column names to use for cleaning and finding topics.
|
213 |
+
max_topics_slider (int): Maximum number of topics.
|
214 |
+
candidate_topics (list): List of candidate topics.
|
215 |
+
data_file_name_no_ext (str): Data file name without extension.
|
216 |
+
custom_labels_df (pd.DataFrame): DataFrame containing custom labels.
|
217 |
+
return_intermediate_files (str): Whether to return intermediate files.
|
218 |
+
embeddings_super_compress (str): Whether to round embeddings to three decimal places.
|
219 |
+
high_quality_mode (str): Whether to use high quality (transformers based) embeddings.
|
220 |
+
save_topic_model (str): Whether to save the topic model.
|
221 |
+
embeddings_out (np.ndarray): Output embeddings.
|
222 |
+
embeddings_type_state (str): State of the embeddings type.
|
223 |
+
zero_shot_similarity (float): Zero-shot similarity threshold.
|
224 |
+
random_seed (int): Random seed for reproducibility.
|
225 |
+
calc_probs (str): Whether to calculate all topic probabilities.
|
226 |
+
vectoriser_state (CountVectorizer): Vectorizer state.
|
227 |
+
min_word_occurence_slider (float): Minimum word occurrence slider value.
|
228 |
+
max_word_occurence_slider (float): Maximum word occurrence slider value.
|
229 |
+
split_sentence_drop (str): Whether to split open text into sentences.
|
230 |
+
original_data_state (pd.DataFrame): Original data state.
|
231 |
+
output_folder (str, optional): Output folder. Defaults to output_folder.
|
232 |
+
umap_n_neighbours (int): Nearest neighbours value for UMAP.
|
233 |
+
umap_min_dist (float): Minimum distance for UMAP.
|
234 |
+
umap_metric (str): Metric for UMAP.
|
235 |
+
progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress(track_tqdm=True).
|
236 |
+
|
237 |
+
Returns:
|
238 |
+
tuple: A tuple containing output text, output list, data, data file name without extension, and an empty numpy array.
|
239 |
+
"""
|
240 |
all_tic = time.perf_counter()
|
241 |
|
242 |
progress(0, desc= "Loading data")
|
243 |
|
244 |
+
vectoriser_state = CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=min_word_occurence_slider, max_df=max_word_occurence_slider)
|
245 |
+
|
246 |
output_list = []
|
247 |
file_list = [string.name for string in in_files]
|
248 |
|
|
|
265 |
# Check if embeddings are being loaded in
|
266 |
progress(0.2, desc= "Loading/creating embeddings")
|
267 |
|
|
|
268 |
|
269 |
+
if high_quality_mode == "Yes":
|
270 |
+
print("Using high quality embedding model")
|
271 |
|
272 |
# Define a list of possible local locations to search for the model
|
273 |
local_embeddings_locations = [
|
|
|
299 |
embeddings_type_state = "large"
|
300 |
|
301 |
# UMAP model uses Bertopic defaults
|
302 |
+
umap_model = UMAP(n_neighbors=umap_n_neighbours, n_components=5, min_dist=umap_min_dist, metric=umap_metric, low_memory=False, random_state=random_seed)
|
303 |
|
304 |
else:
|
305 |
print("Choosing low resource TF-IDF model.")
|
|
|
317 |
|
318 |
#umap_model = TruncatedSVD(n_components=5, random_state=random_seed)
|
319 |
# UMAP model uses Bertopic defaults
|
320 |
+
umap_model = UMAP(n_neighbors=umap_n_neighbours, n_components=5, min_dist=umap_min_dist, metric=umap_metric, low_memory=True, random_state=random_seed)
|
321 |
|
322 |
+
embeddings_out = make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress, high_quality_mode)
|
323 |
|
324 |
# This is saved as a Gradio state object
|
325 |
vectoriser_model = vectoriser_state
|
|
|
344 |
|
345 |
if calc_probs == True:
|
346 |
topics_probs_out = pd.DataFrame(topic_model.probabilities_)
|
347 |
+
topics_probs_out_name = output_folder + "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
348 |
topics_probs_out.to_csv(topics_probs_out_name)
|
349 |
output_list.append(topics_probs_out_name)
|
350 |
|
|
|
352 |
print(error)
|
353 |
print(fail_error_message)
|
354 |
|
355 |
+
out_fail_error_message = '\n'.join([fail_error_message, str(error)])
|
356 |
+
|
357 |
+
return out_fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
|
358 |
|
359 |
|
360 |
# Do this if you have pre-defined topics
|
361 |
else:
|
362 |
+
#if high_quality_mode == "No":
|
363 |
+
# error_message = "Zero shot topic modelling currently not compatible with low-resource embeddings. Please change this option to 'No' on the options tab and retry."
|
364 |
+
# print(error_message)
|
365 |
|
366 |
+
# return error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
|
367 |
|
368 |
zero_shot_topics = read_file(candidate_topics.name)
|
369 |
zero_shot_topics_lower = list(zero_shot_topics.iloc[:, 0].str.lower())
|
370 |
|
371 |
+
print("Zero shot topics are:", zero_shot_topics_lower)
|
372 |
+
|
373 |
|
374 |
try:
|
375 |
topic_model = BERTopic( embedding_model=embedding_model, #embedding_model_pipe, # for Jina
|
|
|
386 |
|
387 |
if calc_probs == True:
|
388 |
topics_probs_out = pd.DataFrame(topic_model.probabilities_)
|
389 |
+
topics_probs_out_name = output_folder + "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
|
390 |
topics_probs_out.to_csv(topics_probs_out_name)
|
391 |
output_list.append(topics_probs_out_name)
|
392 |
|
|
|
394 |
print("An exception occurred:", error)
|
395 |
print(fail_error_message)
|
396 |
|
397 |
+
out_fail_error_message = '\n'.join([fail_error_message, str(error)])
|
398 |
+
|
399 |
+
return out_fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
|
400 |
|
401 |
# For some reason, zero topic modelling exports assigned topics as a np.array instead of a list. Converting it back here.
|
402 |
if isinstance(assigned_topics, np.ndarray):
|
403 |
assigned_topics = assigned_topics.tolist()
|
404 |
|
|
|
|
|
405 |
# Zero shot modelling is a model merge, which wipes the c_tf_idf part of the resulting model completely. To get hierarchical modelling to work, we need to recreate this part of the model with the CountVectorizer options used to create the initial model. Since with zero shot, we are merging two models that have exactly the same set of documents, the vocubulary should be the same, and so recreating the cf_tf_idf component in this way shouldn't be a problem. Discussion here, and below based on Maarten's suggested code: https://github.com/MaartenGr/BERTopic/issues/1700
|
406 |
|
407 |
# Get document info
|
|
|
410 |
documents_per_topic = doc_dets.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
|
411 |
|
412 |
# Assign CountVectorizer to merged model
|
|
|
413 |
topic_model.vectorizer_model = vectoriser_model
|
414 |
|
415 |
# Re-calculate c-TF-IDF
|
416 |
c_tf_idf, _ = topic_model._c_tf_idf(documents_per_topic)
|
417 |
topic_model.c_tf_idf_ = c_tf_idf
|
418 |
|
|
|
|
|
|
|
419 |
# Check we have topics
|
420 |
if not assigned_topics:
|
421 |
return "No topics found.", output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model,[]
|
|
|
423 |
print("Topic model created.")
|
424 |
|
425 |
# Tidy up topic label format a bit to have commas and spaces by default
|
426 |
+
if not candidate_topics:
|
427 |
+
print("Zero shot topics found, so not renaming")
|
428 |
+
new_topic_labels = topic_model.generate_topic_labels(nr_words=3, separator=", ")
|
429 |
+
topic_model.set_topic_labels(new_topic_labels)
|
430 |
+
if candidate_topics:
|
431 |
+
print("Custom labels:", topic_model.custom_labels_)
|
432 |
+
print("Topic labels:", topic_model.topic_labels_)
|
433 |
+
topic_model.set_topic_labels(topic_model.topic_labels_)
|
434 |
|
435 |
# Replace current topic labels if new ones loaded in
|
436 |
if not custom_labels_df.empty:
|
|
|
442 |
print("Custom topics: ", topic_model.custom_labels_)
|
443 |
|
444 |
# Outputs
|
445 |
+
output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
|
446 |
|
447 |
# If you want to save your embedding files
|
448 |
if return_intermediate_files == "Yes":
|
449 |
print("Saving embeddings to file")
|
450 |
+
if high_quality_mode == "Yes":
|
451 |
+
embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
|
452 |
else:
|
453 |
if embeddings_super_compress == "No":
|
454 |
+
embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'large_embeddings.npz'
|
455 |
else:
|
456 |
+
embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'large_embeddings_compress.npz'
|
457 |
|
458 |
np.savez_compressed(embeddings_file_name, embeddings_out)
|
459 |
|
|
|
465 |
|
466 |
return output_text, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model, assigned_topics
|
467 |
|
468 |
+
def reduce_outliers(topic_model: BERTopic, docs: List[str], embeddings_out: np.ndarray, data_file_name_no_ext: str, assigned_topics: Union[np.ndarray, List[int]], vectoriser_model: CountVectorizer, save_topic_model: str, split_sentence_drop: str, data: PandasDataFrame, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
|
469 |
+
"""
|
470 |
+
Reduce outliers in the topic model and update the topic representation.
|
471 |
+
|
472 |
+
Args:
|
473 |
+
topic_model (BERTopic): The BERTopic topic model to be used.
|
474 |
+
docs (List[str]): List of documents.
|
475 |
+
embeddings_out (np.ndarray): Output embeddings.
|
476 |
+
data_file_name_no_ext (str): Data file name without extension.
|
477 |
+
assigned_topics (Union[np.ndarray, List[int]]): Assigned topics.
|
478 |
+
vectoriser_model (CountVectorizer): Vectorizer model.
|
479 |
+
save_topic_model (str): Whether to save the topic model.
|
480 |
+
split_sentence_drop (str): Dropdown result indicating whether sentences have been split.
|
481 |
+
data (PandasDataFrame): The input dataframe
|
482 |
+
progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress(track_tqdm=True).
|
483 |
+
|
484 |
+
Returns:
|
485 |
+
tuple: A tuple containing the output text, output list, and the updated topic model.
|
486 |
+
"""
|
487 |
|
488 |
progress(0, desc= "Preparing data")
|
489 |
|
|
|
491 |
|
492 |
all_tic = time.perf_counter()
|
493 |
|
|
|
|
|
|
|
494 |
if isinstance(assigned_topics, np.ndarray):
|
495 |
assigned_topics = assigned_topics.tolist()
|
496 |
|
|
|
497 |
# Reduce outliers if required, then update representation
|
498 |
progress(0.2, desc= "Reducing outliers")
|
499 |
print("Reducing outliers.")
|
|
|
511 |
|
512 |
print("Finished reducing outliers.")
|
513 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
514 |
# Outputs
|
515 |
progress(0.9, desc= "Saving to file")
|
516 |
+
output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
|
517 |
|
518 |
all_toc = time.perf_counter()
|
519 |
time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
|
|
|
521 |
|
522 |
return output_text, output_list, topic_model
|
523 |
|
524 |
+
def represent_topics(topic_model: BERTopic, docs: List[str], data_file_name_no_ext: str, high_quality_mode: str, save_topic_model: str, representation_type: str, vectoriser_model: CountVectorizer, split_sentence_drop: str, data: PandasDataFrame, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
|
525 |
+
"""
|
526 |
+
Represents topics using the specified representation model and updates the topic labels accordingly.
|
527 |
+
|
528 |
+
Args:
|
529 |
+
topic_model (BERTopic): The topic model to be used.
|
530 |
+
docs (List[str]): List of documents to be processed.
|
531 |
+
data_file_name_no_ext (str): The base name of the data file without extension.
|
532 |
+
high_quality_mode (str): Whether to use high quality (transformers based) embeddings.
|
533 |
+
save_topic_model (str): Whether to save the topic model.
|
534 |
+
representation_type (str): The type of representation model to be used.
|
535 |
+
vectoriser_model (CountVectorizer): The vectorizer model to be used.
|
536 |
+
split_sentence_drop (str): Dropdown result indicating whether sentences have been split.
|
537 |
+
data (PandasDataFrame): The input dataframe
|
538 |
+
progress (gr.Progress, optional): Progress tracker for the process. Defaults to gr.Progress(track_tqdm=True).
|
539 |
+
|
540 |
+
Returns:
|
541 |
+
tuple: A tuple containing the output text, output list, and the updated topic model.
|
542 |
+
"""
|
543 |
|
544 |
output_list = []
|
545 |
|
546 |
all_tic = time.perf_counter()
|
547 |
|
548 |
+
# Load in representation model
|
549 |
+
|
550 |
+
progress(0.1, desc= "Loading model and creating new topic representation")
|
551 |
|
552 |
+
representation_model = create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, high_quality_mode)
|
553 |
|
554 |
progress(0.3, desc= "Updating existing topics")
|
555 |
topic_model.update_topics(docs, vectorizer_model=vectoriser_model, representation_model=representation_model)
|
|
|
561 |
llm_labels = [label[0].split("\n")[0] for label in topic_dets["LLM"]]
|
562 |
topic_model.set_topic_labels(llm_labels)
|
563 |
|
564 |
+
label_list_file_name = output_folder + data_file_name_no_ext + '_llm_topic_list_' + today_rev + '.csv'
|
565 |
|
566 |
llm_labels_df = pd.DataFrame(data={"Label":llm_labels})
|
567 |
llm_labels_df.to_csv(label_list_file_name, index=None)
|
|
|
574 |
|
575 |
# Outputs
|
576 |
progress(0.8, desc= "Saving outputs")
|
577 |
+
output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
|
578 |
|
579 |
all_toc = time.perf_counter()
|
580 |
time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
|
|
|
582 |
|
583 |
return output_text, output_list, topic_model
|
584 |
|
585 |
+
def visualise_topics(
|
586 |
+
topic_model: BERTopic,
|
587 |
+
data: pd.DataFrame,
|
588 |
+
data_file_name_no_ext: str,
|
589 |
+
high_quality_mode: str,
|
590 |
+
embeddings_out: np.ndarray,
|
591 |
+
in_label: List[str],
|
592 |
+
in_colnames: List[str],
|
593 |
+
legend_label: str,
|
594 |
+
sample_prop: float,
|
595 |
+
visualisation_type_radio: str,
|
596 |
+
random_seed: int = random_seed,
|
597 |
+
umap_n_neighbours: int = umap_n_neighbours,
|
598 |
+
umap_min_dist: float = umap_min_dist,
|
599 |
+
umap_metric: str = umap_metric,
|
600 |
+
progress: gr.Progress = gr.Progress(track_tqdm=True)
|
601 |
+
) -> tuple:
|
602 |
+
"""
|
603 |
+
Visualize topics using the provided topic model and data.
|
604 |
+
|
605 |
+
Args:
|
606 |
+
topic_model (BERTopic): The topic model to be used for visualization.
|
607 |
+
data (pd.DataFrame): The input data containing the documents.
|
608 |
+
data_file_name_no_ext (str): The base name of the data file without extension.
|
609 |
+
high_quality_mode (str): Whether to use high quality mode for embeddings.
|
610 |
+
embeddings_out (np.ndarray): The output embeddings.
|
611 |
+
in_label (List[str]): List of labels for the input data.
|
612 |
+
in_colnames (List[str]): List of column names in the input data.
|
613 |
+
legend_label (str): The label to be used in the legend.
|
614 |
+
sample_prop (float): The proportion of data to sample for visualization.
|
615 |
+
visualisation_type_radio (str): The type of visualization to be used.
|
616 |
+
random_seed (int, optional): Random seed for reproducibility. Defaults to random_seed.
|
617 |
+
umap_n_neighbours (int, optional): Number of neighbors for UMAP. Defaults to umap_n_neighbours.
|
618 |
+
umap_min_dist (float, optional): Minimum distance for UMAP. Defaults to umap_min_dist.
|
619 |
+
umap_metric (str, optional): Metric for UMAP. Defaults to umap_metric.
|
620 |
+
progress (gr.Progress, optional): Progress tracker for the process. Defaults to gr.Progress(track_tqdm=True).
|
621 |
+
|
622 |
+
Returns:
|
623 |
+
tuple: A tuple containing the output message, output list, reduced embeddings, and topic model.
|
624 |
+
"""
|
625 |
|
626 |
progress(0, desc= "Preparing data for visualisation")
|
627 |
|
628 |
output_list = []
|
629 |
+
output_message = []
|
630 |
vis_tic = time.perf_counter()
|
631 |
|
632 |
|
|
|
662 |
topic_model.set_topic_labels(labels)
|
663 |
|
664 |
# Pre-reduce embeddings for visualisation purposes
|
665 |
+
if high_quality_mode == "Yes":
|
666 |
+
reduced_embeddings = UMAP(n_neighbors=umap_n_neighbours, n_components=2, min_dist=umap_min_dist, metric=umap_metric, random_state=random_seed).fit_transform(embeddings_out)
|
667 |
else:
|
668 |
reduced_embeddings = TruncatedSVD(2, random_state=random_seed).fit_transform(embeddings_out)
|
669 |
|
670 |
+
progress(0.3, desc= "Creating visualisations")
|
671 |
# Visualise the topics:
|
672 |
|
673 |
+
print("Creating visualisations")
|
|
|
|
|
674 |
|
675 |
if visualisation_type_radio == "Topic document graph":
|
676 |
+
try:
|
677 |
+
topics_vis = visualize_documents_custom(topic_model, docs, hover_labels = label_list, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True, sample = sample_prop, width= 1200, height = 750)
|
678 |
|
679 |
+
topics_vis_name = output_folder + data_file_name_no_ext + '_' + 'vis_topic_docs_' + today_rev + '.html'
|
680 |
+
topics_vis.write_html(topics_vis_name)
|
681 |
+
output_list.append(topics_vis_name)
|
682 |
+
except Exception as e:
|
683 |
+
print(e)
|
684 |
+
output_message = str(e)
|
685 |
+
return output_message, output_list, None, None
|
686 |
|
687 |
+
try:
|
688 |
+
topics_vis_2 = topic_model.visualize_heatmap(custom_labels=True, width= 1200, height = 1200)
|
689 |
|
690 |
+
topics_vis_2_name = output_folder + data_file_name_no_ext + '_' + 'vis_heatmap_' + today_rev + '.html'
|
691 |
+
topics_vis_2.write_html(topics_vis_2_name)
|
692 |
+
output_list.append(topics_vis_2_name)
|
693 |
+
except Exception as e:
|
694 |
+
print(e)
|
695 |
+
output_message.append(str(e))
|
696 |
|
697 |
elif visualisation_type_radio == "Hierarchical view":
|
698 |
|
|
|
701 |
# Print topic tree - may get encoding errors, so doing try except
|
702 |
try:
|
703 |
tree = topic_model.get_topic_tree(hierarchical_topics, tight_layout = True)
|
704 |
+
tree_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_tree_' + today_rev + '.txt'
|
705 |
|
706 |
with open(tree_name, "w") as file:
|
707 |
# Write the string to the file
|
|
|
709 |
|
710 |
output_list.append(tree_name)
|
711 |
|
712 |
+
except Exception as e:
|
713 |
+
new_out_message = "An exception occurred when making topic tree document, skipped:" + str(e)
|
714 |
+
output_message.append(str(new_out_message))
|
715 |
+
print(new_out_message)
|
716 |
|
717 |
|
718 |
# Save new hierarchical topic model to file
|
719 |
+
try:
|
720 |
+
hierarchical_topics_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_topics_dist_' + today_rev + '.csv'
|
721 |
+
hierarchical_topics.to_csv(hierarchical_topics_name, index = None)
|
722 |
+
output_list.append(hierarchical_topics_name)
|
723 |
+
|
724 |
+
topics_vis, hierarchy_df, hierarchy_topic_names = visualize_hierarchical_documents_custom(topic_model, docs, label_list, hierarchical_topics, hide_annotations=True, reduced_embeddings=reduced_embeddings, sample = sample_prop, hide_document_hover= False, custom_labels=True, width= 1200, height = 750)
|
725 |
+
topics_vis_2 = visualize_hierarchy_custom(topic_model, hierarchical_topics=hierarchical_topics, width= 1200, height = 750)
|
726 |
+
except Exception as e:
|
727 |
+
new_out_message = "An exception occurred when making hierarchical topic visualisation:" + str(e) + ". Maybe your model doesn't have enough topics to create a hierarchy?"
|
728 |
+
output_message.append(str(new_out_message))
|
729 |
+
print(new_out_message)
|
730 |
+
return new_out_message, output_list, None, None
|
731 |
|
732 |
# Write hierarchical topics levels to df
|
733 |
+
hierarchy_df_name = output_folder + data_file_name_no_ext + '_' + 'hierarchy_topics_df_' + today_rev + '.csv'
|
734 |
hierarchy_df.to_csv(hierarchy_df_name, index = None)
|
735 |
output_list.append(hierarchy_df_name)
|
736 |
|
737 |
# Write hierarchical topics names to df
|
738 |
+
hierarchy_topic_names_name = output_folder + data_file_name_no_ext + '_' + 'hierarchy_topics_names_' + today_rev + '.csv'
|
739 |
hierarchy_topic_names.to_csv(hierarchy_topic_names_name, index = None)
|
740 |
output_list.append(hierarchy_topic_names_name)
|
741 |
|
|
|
|
|
|
|
742 |
|
743 |
+
topics_vis_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_topic_doc_' + today_rev + '.html'
|
744 |
topics_vis.write_html(topics_vis_name)
|
745 |
output_list.append(topics_vis_name)
|
746 |
|
747 |
+
topics_vis_2_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_' + today_rev + '.html'
|
748 |
topics_vis_2.write_html(topics_vis_2_name)
|
749 |
output_list.append(topics_vis_2_name)
|
750 |
|
751 |
all_toc = time.perf_counter()
|
752 |
+
output_message.append(f"Creating visualisation took {all_toc - vis_tic:0.1f} seconds")
|
753 |
+
print(output_message)
|
754 |
+
|
755 |
+
return '\n'.join(output_message), output_list, topics_vis, topics_vis_2
|
756 |
|
757 |
+
def save_as_pytorch_model(topic_model: BERTopic, data_file_name_no_ext:str, progress=gr.Progress(track_tqdm=True)):
|
758 |
+
"""
|
759 |
+
Reduce outliers in the topic model and update the topic representation.
|
760 |
|
761 |
+
Args:
|
762 |
+
topic_model (BERTopic): The BERTopic topic model to be used.
|
763 |
+
data_file_name_no_ext (str): Document file name.
|
764 |
+
Returns:
|
765 |
+
tuple: A tuple containing the output text and output list.
|
766 |
+
"""
|
767 |
+
output_list = []
|
768 |
+
output_message = ""
|
769 |
|
770 |
if not topic_model:
|
771 |
+
output_message = "No Pytorch model found."
|
772 |
+
return output_message, None
|
773 |
|
774 |
progress(0, desc= "Saving topic model in Pytorch format")
|
775 |
|
776 |
+
topic_model_save_name_folder = output_folder + data_file_name_no_ext + "_topics_" + today_rev# + ".safetensors"
|
|
|
|
|
|
|
777 |
topic_model_save_name_zip = topic_model_save_name_folder + ".zip"
|
778 |
|
779 |
# Clear folder before replacing files
|
|
|
781 |
|
782 |
topic_model.save(topic_model_save_name_folder, serialization='pytorch', save_embedding_model=True, save_ctfidf=False)
|
783 |
|
784 |
+
# Zip file example
|
|
|
785 |
zip_folder(topic_model_save_name_folder, topic_model_save_name_zip)
|
786 |
output_list.append(topic_model_save_name_zip)
|
787 |
|
788 |
+
output_message = "Model saved in Pytorch format."
|
789 |
+
|
790 |
+
return output_message, output_list
|
requirements.txt
CHANGED
@@ -1,8 +1,7 @@
|
|
1 |
-
gradio
|
2 |
transformers==4.41.2
|
3 |
accelerate==0.26.1
|
4 |
torch==2.3.1
|
5 |
-
llama-cpp-python==0.2.79
|
6 |
bertopic==0.16.2
|
7 |
spacy==3.7.4
|
8 |
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
|
@@ -13,4 +12,6 @@ presidio_analyzer==2.2.354
|
|
13 |
presidio_anonymizer==2.2.354
|
14 |
scipy==1.11.4
|
15 |
polars==0.20.6
|
16 |
-
|
|
|
|
|
|
1 |
+
gradio # Not specified version due to interaction with spacy - reinstall latest version after requirements.txt load
|
2 |
transformers==4.41.2
|
3 |
accelerate==0.26.1
|
4 |
torch==2.3.1
|
|
|
5 |
bertopic==0.16.2
|
6 |
spacy==3.7.4
|
7 |
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
|
|
|
12 |
presidio_anonymizer==2.2.354
|
13 |
scipy==1.11.4
|
14 |
polars==0.20.6
|
15 |
+
sentence-transformers==3.0.1
|
16 |
+
llama-cpp-python==0.2.79 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
|
17 |
+
numpy==1.26.4
|
requirements_gpu.txt
CHANGED
@@ -1,7 +1,6 @@
|
|
1 |
-
gradio
|
2 |
transformers==4.41.2
|
3 |
accelerate==0.26.1
|
4 |
-
torch==2.3.1
|
5 |
bertopic==0.16.2
|
6 |
spacy==3.7.4
|
7 |
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
|
@@ -15,3 +14,4 @@ polars==0.20.6
|
|
15 |
torch --index-url https://download.pytorch.org/whl/cu121
|
16 |
llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
|
17 |
numpy==1.26.4
|
|
|
|
1 |
+
gradio # Not specified version due to interaction with spacy - reinstall latest version after requirements.txt load
|
2 |
transformers==4.41.2
|
3 |
accelerate==0.26.1
|
|
|
4 |
bertopic==0.16.2
|
5 |
spacy==3.7.4
|
6 |
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
|
|
|
14 |
torch --index-url https://download.pytorch.org/whl/cu121
|
15 |
llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
|
16 |
numpy==1.26.4
|
17 |
+
sentence-transformers==3.0.1
|