Sean-Case commited on
Commit
a7fdf3b
1 Parent(s): 9c6425d

Switched embeddings to low resource TF-IDF by default. Some text changes.

Browse files
Files changed (1) hide show
  1. app.py +6 -4
app.py CHANGED
@@ -26,9 +26,11 @@ with block:
26
  gr.Markdown(
27
  """
28
  # Topic modeller
29
- Generate topics from open text in tabular data. Upload a file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics, and another for labels in the visualisation. If you have an embeddings .npz file of the text made using the 'BAAI/bge-small-en-v1.5' model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab.
 
 
30
 
31
- Suggested test dataset: https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data (passages.parquet)
32
  """)
33
 
34
  with gr.Tab("Load files and find topics"):
@@ -41,7 +43,7 @@ with block:
41
  with gr.Row():
42
  clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Clean data - remove html, numbers with > 2 digits, emails, postcodes (UK).")
43
  drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 10 char strings. May make previous embedding files incompatible due to differing lengths.")
44
- anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective!")
45
  clean_btn = gr.Button("Clean data")
46
 
47
  with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
@@ -87,7 +89,7 @@ with block:
87
  seed_number = gr.Number(label="Random seed to use for dimensionality reduction.", minimum=0, step=1, value=42, precision=0)
88
  calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
89
  with gr.Row():
90
- low_resource_mode_opt = gr.Dropdown(label = "Use low resource embeddings and processing.", value="No", choices=["Yes", "No"])
91
  embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp for smaller files with less accuracy.", value="No", choices=["Yes", "No"])
92
  with gr.Row():
93
  return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])
 
26
  gr.Markdown(
27
  """
28
  # Topic modeller
29
+ Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics.
30
+
31
+ Uses fast TF-IDF-based embeddings by default, change to 'BAAI/bge-small-en-v1.5' model embeddings on the options page. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab.
32
 
33
+ I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
34
  """)
35
 
36
  with gr.Tab("Load files and find topics"):
 
43
  with gr.Row():
44
  clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Clean data - remove html, numbers with > 2 digits, emails, postcodes (UK).")
45
  drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 10 char strings. May make previous embedding files incompatible due to differing lengths.")
46
+ anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective. This is slow!")
47
  clean_btn = gr.Button("Clean data")
48
 
49
  with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
 
89
  seed_number = gr.Number(label="Random seed to use for dimensionality reduction.", minimum=0, step=1, value=42, precision=0)
90
  calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
91
  with gr.Row():
92
+ low_resource_mode_opt = gr.Dropdown(label = "Use low resource embeddings and processing.", value="Yes", choices=["Yes", "No"])
93
  embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp for smaller files with less accuracy.", value="No", choices=["Yes", "No"])
94
  with gr.Row():
95
  return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])