seanpedrickcase commited on
Commit
99d6fba
β€’
1 Parent(s): d3b1ac5

Many changes to code organisation. More efficient searches from using intermediate outputs. Version 0.1

Browse files
.gitignore CHANGED
@@ -13,7 +13,11 @@
13
  *.ipynb
14
  *.npy
15
  *.npz
 
 
16
  build/*
17
  dist/*
18
  __pycache__/*
19
- db/*
 
 
 
13
  *.ipynb
14
  *.npy
15
  *.npz
16
+ *.pkl
17
+ *.pkl.gz
18
  build/*
19
  dist/*
20
  __pycache__/*
21
+ db/*
22
+ experiments/*
23
+ model/*
README.md CHANGED
@@ -10,9 +10,10 @@ pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- Keyword search over your data. This is an adaptation of fast_bm25 (https://github.com/Inspirateur/Fast-BM25) to search over tabular data with a Gradio UI interface.
14
 
15
  # Guide
 
16
 
17
  1. Load in your tabular data file (.csv, .parquet, .xlsx - first sheet).
18
  2. Wait a few seconds for the file to upload, then in the dropdown menu below 'Enter the name of the text column...' choose the column from the data file that you want to search.
@@ -21,17 +22,36 @@ Keyword search over your data. This is an adaptation of fast_bm25 (https://githu
21
  5. Hit search text. You may have to wait depending on the size of the data you are searching.
22
  6. You will receive back 1. the top search result and 2. a csv of the search results found in the text ordered by relevance, joined onto the original columns from your data source.
23
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  # Advanced options
25
- The search should perform well with default options, so you shouldn't need to change things here.
26
 
27
  ## Data load / save options
28
- Toggle 'Clean text during load...' to true if you want to remove html tags and lemmatise the text, i.e. remove the ends of words to retain the core of the word e.g. searched or searches becomes search. Early testing suggests that cleaning takes some time, and does not seem to improve quality of search results.
 
 
29
 
30
- ## Search options
 
 
31
  Here are a few options to modify the BM25 search parameters. If you want more information on what each parameter does, click the relevant info button to the right of the sliders.
32
 
 
 
 
33
  ## Join on additional dataframes to results
34
- I was asked to include a feature to join on additional data to the search results. This could be useful for example if you have tabular text data associated with a person ID, and after searching you would like to join on information associated with this person to aid with post-search filtering/analysis.
35
 
36
  To do this:
37
  1. Load in the tabular data you want to join in the box (.csv, .parquet, .xlsx - first sheet).
 
10
  license: apache-2.0
11
  ---
12
 
13
+ Search through long-form text fields in your tabular data. Either for exact, specific terms (Keyword search), or thematic, 'fuzzy' search (Semantic search).
14
 
15
  # Guide
16
+ ## Keyword search
17
 
18
  1. Load in your tabular data file (.csv, .parquet, .xlsx - first sheet).
19
  2. Wait a few seconds for the file to upload, then in the dropdown menu below 'Enter the name of the text column...' choose the column from the data file that you want to search.
 
22
  5. Hit search text. You may have to wait depending on the size of the data you are searching.
23
  6. You will receive back 1. the top search result and 2. a csv of the search results found in the text ordered by relevance, joined onto the original columns from your data source.
24
 
25
+ ## Semantic search
26
+
27
+ This search type enables you to search for broader themes (e.g. happiness, nature) and the search will pick out text passages that relate to these themes even if they don't contain the exact words.
28
+
29
+ 1. Load in your tabular data file (.csv, .parquet, .xlsx - first sheet).
30
+ 2. Wait a few seconds for the file to upload, then in the dropdown menu below 'Enter the name of the text column...' choose the column from the data file that you want to search.
31
+ 3. Hit 'Load data'. The 'Load progress' text box will let you know when the file is ready.
32
+ 4. In the 'Enter semantic search query here' area below this, type in the terms you would like to search for.
33
+ 5. Press 'Start semantic search'. You may have to wait depending on the size of the data you are searching.
34
+ 6. You will receive back 1. the top search result and 2. a csv of the search results found in the text ordered by relevance, joined onto the original columns from your data source.
35
+
36
+
37
  # Advanced options
38
+ The search should perform well with default options, so you shouldn't need to change things here. More details on each parameter is provided below.
39
 
40
  ## Data load / save options
41
+ Toggle 'Clean text during load...' to "Yes" if you want to remove html tags and lemmatise the text, i.e. remove the ends of words to retain the core of the word e.g. searched or searches becomes search. Early testing suggests that cleaning takes some time, and does not seem to improve quality of search results.
42
+
43
+ 'Return intermediate files', when set to "Yes", will save a tokenised text file (for keyword search), or an embedded text file (for semantic search) during data preparation. These files can then be loaded in next time alongside the data files to save preparation time for future search sessions.
44
 
45
+ 'Round embeddings to three dp...' will reduce the precision of the embedding outputs to 3 decimal places, and will multiply all values by 100, reducing the size of the output numpy array by about 50%. It seems to have minimal effect on the output search result according to simple search comparisons, but I cannot guarantee this!
46
+
47
+ ## Keyword search options
48
  Here are a few options to modify the BM25 search parameters. If you want more information on what each parameter does, click the relevant info button to the right of the sliders.
49
 
50
+ ## Semantic search options
51
+ The only option here currently is the minimum similarity distance that should be included in the results. The default works quite well, anything above 0.85 tends to return no results in my experience.
52
+
53
  ## Join on additional dataframes to results
54
+ Join on additional data to the search results. This could be useful for example if you have tabular text data associated with a person ID, and after searching you would like to join on information associated with this person to aid with post-search filtering/analysis.
55
 
56
  To do this:
57
  1. Load in the tabular data you want to join in the box (.csv, .parquet, .xlsx - first sheet).
app.py CHANGED
@@ -1,707 +1,17 @@
1
- import nltk
2
- from typing import TypeVar
3
- nltk.download('names')
4
- nltk.download('stopwords')
5
- nltk.download('wordnet')
6
- nltk.download('punkt')
7
-
8
- from search_funcs.fast_bm25 import BM25
9
- from search_funcs.clean_funcs import initial_clean, get_lemma_tokens#, stem_sentence
10
- from nltk import word_tokenize
11
- #from sentence_transformers import SentenceTransformer
12
-
13
- # Try SpaCy alternative tokeniser
14
-
15
- PandasDataFrame = TypeVar('pd.core.frame.DataFrame')
16
 
17
  import gradio as gr
18
  import pandas as pd
19
- import numpy as np
20
- import os
21
- import time
22
- import math
23
- from itertools import islice
24
- from chromadb.config import Settings
25
-
26
- from transformers import AutoModel
27
-
28
- # Load the SpaCy mode
29
- from spacy.cli import download
30
- import spacy
31
- spacy.prefer_gpu()
32
-
33
- #os.system("python -m spacy download en_core_web_sm")
34
- try:
35
- nlp = spacy.load("en_core_web_sm")
36
- except:
37
- download("en_core_web_sm")
38
- nlp = spacy.load("en_core_web_sm")
39
-
40
-
41
- # model = AutoModel.from_pretrained('./model_and_tokenizer/int8-model.onnx', use_embedding_runtime=True)
42
- # sentence_embeddings = model.generate(engine_input)['last_hidden_state:0']
43
-
44
- # print("Sentence embeddings:", sentence_embeddings)
45
-
46
- import search_funcs.ingest as ing
47
- #import search_funcs.chatfuncs as chatf
48
-
49
- # Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
50
- import chromadb
51
- #from typing_extensions import Protocol
52
- #from chromadb import Documents, EmbeddingFunction, Embeddings
53
-
54
- from torch import cuda, backends, tensor, mm
55
-
56
- # Check for torch cuda
57
- print(cuda.is_available())
58
- print(backends.cudnn.enabled)
59
- if cuda.is_available():
60
- torch_device = "cuda"
61
- os.system("nvidia-smi")
62
-
63
- else:
64
- torch_device = "cpu"
65
-
66
- # Remove Chroma database file. If it exists as it can cause issues
67
- chromadb_file = "chroma.sqlite3"
68
-
69
- if os.path.isfile(chromadb_file):
70
- os.remove(chromadb_file)
71
-
72
-
73
- def load_embeddings(embeddings_name = "jinaai/jina-embeddings-v2-small-en"):
74
- '''
75
- Load embeddings model and create a global variable based on it.
76
- '''
77
-
78
- # Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
79
-
80
- #else:
81
- embeddings_func = AutoModel.from_pretrained(embeddings_name, trust_remote_code=True, device_map="auto")
82
-
83
- global embeddings
84
-
85
- embeddings = embeddings_func
86
-
87
- return embeddings
88
-
89
- # Load embeddings
90
- embeddings_name = "jinaai/jina-embeddings-v2-small-en"
91
- embeddings_model = AutoModel.from_pretrained(embeddings_name, trust_remote_code=True, device_map="auto")
92
- #embeddings_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
93
- #embeddings_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")
94
-
95
- #tokenizer = AutoTokenizer.from_pretrained(embeddings_name, device_map = "auto")#to(torch_device) # From Jina
96
- # Construction 2 - from SpaCy - https://spacy.io/api/tokenizer
97
-
98
-
99
- #from spacy.lang.en import English
100
- #nlp = #English()
101
- # Create a Tokenizer with the default settings for English
102
- # including punctuation rules and exceptions
103
- tokenizer = nlp.tokenizer
104
-
105
- embeddings = embeddings_model#load_embeddings(embeddings_name)
106
-
107
-
108
- def prepare_input_data(in_file, text_column, clean="No", progress=gr.Progress()):
109
-
110
- file_list = [string.name for string in in_file]
111
-
112
- print(file_list)
113
-
114
- data_file_names = [string for string in file_list if "tokenised" not in string]
115
-
116
- df = read_file(data_file_names[0])
117
-
118
- ## Load in pre-tokenised corpus if exists
119
- tokenised_df = pd.DataFrame()
120
-
121
- tokenised_file_names = [string for string in file_list if "tokenised" in string]
122
-
123
- if tokenised_file_names:
124
- tokenised_df = read_file(tokenised_file_names[0])
125
- print("Tokenised df is: ", tokenised_df.head())
126
-
127
- #df = pd.read_parquet(file_in.name)
128
- df_list = list(df[text_column].astype(str).str.lower())
129
-
130
- # def get_total_batches(my_list, batch_size):
131
- # return math.ceil(len(my_list) / batch_size)
132
-
133
- # def batch(iterable, batch_size):
134
- # iterator = iter(iterable)
135
- # for first in iterator:
136
- # yield [first] + list(islice(iterator, batch_size - 1))
137
-
138
- batch_size = 256
139
-
140
- tic = time.perf_counter()
141
-
142
- if clean == "Yes":
143
- df_list_clean = initial_clean(df_list)
144
-
145
- # Save to file if you have cleaned the data
146
- out_file_name = save_prepared_data(in_file, df_list_clean, df, text_column)
147
-
148
-
149
- # Tokenize texts in batches
150
- if not tokenised_df.empty:
151
- corpus = tokenised_df.iloc[:,0].tolist()
152
- print("Corpus is: ", corpus[0:5])
153
-
154
- else:
155
- corpus = []
156
- for doc in tokenizer.pipe(progress.tqdm(df_list_clean, desc = "Tokenising text", unit = "rows"), batch_size=batch_size):
157
- corpus.append([token.text for token in doc])
158
-
159
- else:
160
-
161
- print(df_list[0])
162
-
163
- # Tokenize texts in batches
164
- if not tokenised_df.empty:
165
- corpus = tokenised_df.iloc[:,0].tolist()
166
- print("Corpus is: ", corpus[0:5])
167
-
168
- else:
169
-
170
- corpus = []
171
- for doc in tokenizer.pipe(progress.tqdm(df_list, desc = "Tokenising text", unit = "rows"), batch_size=batch_size):
172
- corpus.append([token.text for token in doc])
173
-
174
- out_file_name = None
175
-
176
- print(corpus[0])
177
-
178
-
179
- toc = time.perf_counter()
180
- tokenizer_time_out = f"Tokenising the text took {toc - tic:0.1f} seconds"
181
-
182
- print("Finished data clean. " + tokenizer_time_out)
183
-
184
- if len(df_list) >= 20:
185
- message = "Data loaded"
186
- else:
187
- message = "Data loaded. Warning: dataset may be too short to get consistent search results."
188
-
189
- tokenised_data_file_name = "keyword_search_tokenised_data.parquet"
190
- pd.DataFrame(data={"Corpus":corpus}).to_parquet(tokenised_data_file_name)
191
-
192
- return corpus, message, df, out_file_name, tokenised_data_file_name
193
-
194
- def get_file_path_end(file_path):
195
- # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
196
- basename = os.path.basename(file_path)
197
-
198
- # Then, split the basename and its extension and return only the basename without the extension
199
- filename_without_extension, _ = os.path.splitext(basename)
200
-
201
- print(filename_without_extension)
202
-
203
- return filename_without_extension
204
-
205
- def save_prepared_data(in_file, prepared_text_list, in_df, in_bm25_column):
206
-
207
- # Check if the list and the dataframe have the same length
208
- if len(prepared_text_list) != len(in_df):
209
- raise ValueError("The length of 'prepared_text_list' and 'in_df' must match.")
210
-
211
- file_end = ".parquet"
212
-
213
- file_name = get_file_path_end(in_file.name) + "_cleaned" + file_end
214
-
215
- prepared_text_df = pd.DataFrame(data={in_bm25_column + "_cleaned":prepared_text_list})
216
-
217
- # Drop original column from input file to reduce file size
218
- in_df = in_df.drop(in_bm25_column, axis = 1)
219
-
220
- prepared_df = pd.concat([in_df, prepared_text_df], axis = 1)
221
-
222
- if file_end == ".csv":
223
- prepared_df.to_csv(file_name)
224
- elif file_end == ".parquet":
225
- prepared_df.to_parquet(file_name)
226
- else: file_name = None
227
-
228
-
229
- return file_name
230
-
231
- def prepare_bm25(corpus, k1=1.5, b = 0.75, alpha=-5):
232
- #bm25.save("saved_df_bm25")
233
- #bm25 = BM25.load(re.sub(r'\.pkl$', '', file_in.name))
234
-
235
- print("Preparing BM25 corpus")
236
-
237
- global bm25
238
- bm25 = BM25(corpus, k1=k1, b=b, alpha=alpha)
239
-
240
- message = "Search parameters loaded."
241
-
242
- print(message)
243
-
244
- return message
245
-
246
- def convert_query_to_tokens(free_text_query, clean="No"):
247
- '''
248
- Split open text query into tokens and then lemmatise to get the core of the word
249
- '''
250
-
251
- if clean=="Yes":
252
- split_query = word_tokenize(free_text_query.lower())
253
- out_query = get_lemma_tokens(split_query)
254
- #out_query = stem_sentence(free_text_query)
255
- else:
256
- split_query = word_tokenize(free_text_query.lower())
257
- out_query = split_query
258
-
259
- return out_query
260
-
261
- def bm25_search(free_text_query, in_no_search_results, original_data, text_column, clean = "No", in_join_file = None, in_join_column = "", search_df_join_column = ""):
262
-
263
- # Prepare query
264
- if (clean == "Yes") | (text_column.endswith("_cleaned")):
265
- token_query = convert_query_to_tokens(free_text_query, clean="Yes")
266
- else:
267
- token_query = convert_query_to_tokens(free_text_query, clean="No")
268
-
269
- print(token_query)
270
-
271
- # Perform search
272
- print("Searching")
273
-
274
- results_index, results_text, results_scores = bm25.extract_documents_and_scores(token_query, bm25.corpus, n=in_no_search_results) #bm25.corpus #original_data[text_column]
275
- if not results_index:
276
- return "No search results found", None, token_query
277
-
278
- print("Search complete")
279
-
280
- # Prepare results and export
281
- joined_texts = [' '.join(inner_list) for inner_list in results_text]
282
- results_df = pd.DataFrame(data={"index": results_index,
283
- "search_text": joined_texts,
284
- "search_score_abs": results_scores})
285
- results_df['search_score_abs'] = abs(round(results_df['search_score_abs'], 2))
286
- results_df_out = results_df[['index', 'search_text', 'search_score_abs']].merge(original_data,left_on="index", right_index=True, how="left")#.drop("index", axis=1)
287
-
288
- # Join on additional files
289
- if in_join_file:
290
- join_filename = in_join_file.name
291
-
292
- # Import data
293
- join_df = read_file(join_filename)
294
- join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
295
- results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
296
-
297
- # Duplicates dropped so as not to expand out dataframe
298
- join_df = join_df.drop_duplicates(in_join_column)
299
-
300
- results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
301
-
302
- # Reorder results by score
303
- results_df_out = results_df_out.sort_values('search_score_abs', ascending=False)
304
-
305
- # Out file
306
- results_df_name = "search_result.csv"
307
- results_df_out.to_csv(results_df_name, index= None)
308
- results_first_text = results_df_out[text_column].iloc[0]
309
-
310
- print("Returning results")
311
-
312
- return results_first_text, results_df_name, token_query
313
-
314
- def detect_file_type(filename):
315
- """Detect the file type based on its extension."""
316
- if (filename.endswith('.csv')) | (filename.endswith('.csv.gz')) | (filename.endswith('.zip')):
317
- return 'csv'
318
- elif filename.endswith('.xlsx'):
319
- return 'xlsx'
320
- elif filename.endswith('.parquet'):
321
- return 'parquet'
322
- else:
323
- raise ValueError("Unsupported file type.")
324
-
325
- def read_file(filename):
326
- """Read the file based on its detected type."""
327
- file_type = detect_file_type(filename)
328
-
329
- if file_type == 'csv':
330
- return pd.read_csv(filename, low_memory=False).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
331
- elif file_type == 'xlsx':
332
- return pd.read_excel(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
333
- elif file_type == 'parquet':
334
- return pd.read_parquet(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
335
-
336
- def put_columns_in_df(in_file, in_bm25_column):
337
- '''
338
- When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
339
- '''
340
-
341
- file_list = [string.name for string in in_file]
342
-
343
- print(file_list)
344
-
345
- data_file_names = [string for string in file_list if "tokenised" not in string]
346
-
347
- new_choices = []
348
- concat_choices = []
349
-
350
-
351
- df = read_file(data_file_names[0])
352
- new_choices = list(df.columns)
353
-
354
- #print(new_choices)
355
-
356
- concat_choices.extend(new_choices)
357
-
358
- return gr.Dropdown(choices=concat_choices), gr.Dropdown(value="No", choices = ["Yes", "No"]),\
359
- gr.Dropdown(choices=concat_choices)
360
-
361
- def put_columns_in_join_df(in_file, in_bm25_column):
362
- '''
363
- When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
364
- '''
365
-
366
- print("in_bm25_column")
367
-
368
- new_choices = []
369
- concat_choices = []
370
-
371
-
372
- df = read_file(in_file.name)
373
- new_choices = list(df.columns)
374
 
375
- print(new_choices)
376
-
377
- concat_choices.extend(new_choices)
378
-
379
- return gr.Dropdown(choices=concat_choices)
380
-
381
- def dummy_function(gradio_component):
382
- """
383
- A dummy function that exists just so that dropdown updates work correctly.
384
- """
385
- return None
386
-
387
- def display_info(info_component):
388
- gr.Info(info_component)
389
-
390
- def docs_to_chroma_save(docs_out, embeddings = embeddings, progress=gr.Progress()):
391
- '''
392
- Takes a Langchain document class and saves it into a Chroma sqlite file.
393
- '''
394
-
395
- print(f"> Total split documents: {len(docs_out)}")
396
-
397
- #print(docs_out)
398
-
399
- page_contents = [doc.page_content for doc in docs_out]
400
- page_meta = [doc.metadata for doc in docs_out]
401
- ids_range = range(0,len(page_contents))
402
- ids = [str(element) for element in ids_range]
403
-
404
- tic = time.perf_counter()
405
- #embeddings_list = []
406
- #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
407
- # embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
408
-
409
- embeddings_list = embeddings.encode(sentences=page_contents, max_length=256, show_progress_bar = True, batch_size = 32).tolist() # For Jina embeddings
410
- #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
411
- #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
412
-
413
- toc = time.perf_counter()
414
- time_out = f"The embedding took {toc - tic:0.1f} seconds"
415
-
416
- #pd.Series(embeddings_list).to_csv("embeddings_out.csv")
417
-
418
- # Jina tiny
419
- # This takes about 300 seconds for 240,000 records = 800 / second, 1024 max length
420
- # For 50k records:
421
- # 61 seconds at 1024 max length
422
- # 55 seconds at 512 max length
423
- # 43 seconds at 256 max length
424
- # 31 seconds at 128 max length
425
-
426
- # The embedding took 1372.5 seconds at 256 max length for 655,020 case notes
427
-
428
- # BGE small
429
- # 96 seconds for 50k records at 512 length
430
-
431
- # all-MiniLM-L6-v2
432
- # 42.5 seconds at (256?) max length
433
-
434
- # paraphrase-MiniLM-L3-v2
435
- # 22 seconds for 128 max length
436
-
437
-
438
- print(time_out)
439
-
440
- chroma_tic = time.perf_counter()
441
-
442
- # Create a new Chroma collection to store the documents and metadata. We don't need to specify an embedding fuction, and the default will be used.
443
- client = chromadb.PersistentClient(path="./last_year", settings=Settings(
444
- anonymized_telemetry=False))
445
-
446
- try:
447
- print("Deleting existing collection.")
448
- #collection = client.get_collection(name="my_collection")
449
- client.delete_collection(name="my_collection")
450
- print("Creating new collection.")
451
- collection = client.create_collection(name="my_collection")
452
- except:
453
- print("Creating new collection.")
454
- collection = client.create_collection(name="my_collection")
455
-
456
- # Match batch size is about 40,000, so add that amount in a loop
457
- def create_batch_ranges(in_list, batch_size=40000):
458
- total_rows = len(in_list)
459
- ranges = []
460
-
461
- for start in range(0, total_rows, batch_size):
462
- end = min(start + batch_size, total_rows)
463
- ranges.append(range(start, end))
464
-
465
- return ranges
466
-
467
- batch_ranges = create_batch_ranges(embeddings_list)
468
- print(batch_ranges)
469
-
470
- for row_range in progress.tqdm(batch_ranges, desc = "Creating vector database", unit = "batches of 40,000 rows"):
471
-
472
- collection.add(
473
- documents = page_contents[row_range[0]:row_range[-1]],
474
- embeddings = embeddings_list[row_range[0]:row_range[-1]],
475
- metadatas = page_meta[row_range[0]:row_range[-1]],
476
- ids = ids[row_range[0]:row_range[-1]])
477
-
478
- print(collection.count())
479
-
480
- #chatf.vectorstore = vectorstore_func
481
-
482
- chroma_toc = time.perf_counter()
483
-
484
- chroma_time_out = f"Loading to Chroma db took {chroma_toc - chroma_tic:0.1f} seconds"
485
- print(chroma_time_out)
486
-
487
- out_message = "Document processing complete"
488
-
489
- return out_message, collection
490
-
491
- def docs_to_np_array(docs_out, in_file, embeddings = embeddings, progress=gr.Progress()):
492
- '''
493
- Takes a Langchain document class and saves it into a Chroma sqlite file.
494
- '''
495
-
496
- print(f"> Total split documents: {len(docs_out)}")
497
-
498
- #print(docs_out)
499
-
500
- page_contents = [doc.page_content for doc in docs_out]
501
-
502
-
503
- ## Load in pre-embedded file if exists
504
- file_list = [string.name for string in in_file]
505
-
506
- #print(file_list)
507
-
508
- embeddings_file_names = [string for string in file_list if "embedding" in string]
509
-
510
- out_message = "Document processing complete. Ready to search."
511
-
512
- if embeddings_file_names:
513
- embeddings_out = np.load(embeddings_file_names[0])['arr_0']
514
- print("embeddings loaded: ", embeddings_out)
515
-
516
- if not embeddings_file_names:
517
- tic = time.perf_counter()
518
- #embeddings_list = []
519
- #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
520
- # embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
521
-
522
- embeddings_out = embeddings.encode(sentences=page_contents, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina embeddings
523
- #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
524
- #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
525
-
526
- print(embeddings_out)
527
- embeddings_out_round = np.round(embeddings_out, 4)
528
-
529
- toc = time.perf_counter()
530
- time_out = f"The embedding took {toc - tic:0.1f} seconds"
531
-
532
- semantic_search_file_name = 'semantic_search_embeddings.npz'
533
- semantic_search_rounded_file_name = 'semantic_search_embeddings_rounded.npz'
534
-
535
- np.savez_compressed(semantic_search_file_name, embeddings_out)
536
- np.savez_compressed(semantic_search_rounded_file_name, embeddings_out_round)
537
-
538
- return out_message, embeddings_out, semantic_search_file_name, semantic_search_rounded_file_name
539
-
540
- print(out_message)
541
-
542
- return out_message, embeddings_out, None, None
543
-
544
- def process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column):
545
-
546
- def create_docs_keep_from_df(df):
547
- dict_out = {'ids' : [df['ids']],
548
- 'documents': [df['documents']],
549
- 'metadatas': [df['metadatas']],
550
- 'distances': [round(df['distances'].astype(float), 3)],
551
- 'embeddings': None
552
- }
553
- return dict_out
554
-
555
- # Prepare the DataFrame by transposing
556
- #df_docs = df#.apply(lambda x: x.explode()).reset_index(drop=True)
557
-
558
- # Keep only documents with a certain score
559
-
560
- #print(df_docs)
561
-
562
- docs_scores = df_docs["distances"] #.astype(float)
563
-
564
- # Only keep sources that are sufficiently relevant (i.e. similarity search score below threshold below)
565
- score_more_limit = df_docs.loc[docs_scores > vec_score_cut_off, :]
566
- #docs_keep = create_docs_keep_from_df(score_more_limit) #list(compress(docs, score_more_limit))
567
-
568
- #print(docs_keep)
569
-
570
- if score_more_limit.empty:
571
- return 'No result found!', None
572
-
573
- # Only keep sources that are at least 100 characters long
574
- docs_len = score_more_limit["documents"].str.len() >= 100
575
-
576
- #print(docs_len)
577
-
578
- length_more_limit = score_more_limit.loc[docs_len == True, :] #pd.Series(docs_len) >= 100
579
- #docs_keep = create_docs_keep_from_df(length_more_limit) #list(compress(docs_keep, length_more_limit))
580
-
581
- #print(length_more_limit)
582
-
583
- if length_more_limit.empty:
584
- return 'No result found!', None
585
-
586
- length_more_limit['ids'] = length_more_limit['ids'].astype(int)
587
-
588
- #length_more_limit.to_csv("length_more_limit.csv", index = None)
589
-
590
- # Explode the 'metadatas' dictionary into separate columns
591
- df_metadata_expanded = length_more_limit['metadatas'].apply(pd.Series)
592
-
593
- #print(length_more_limit)
594
- #print(df_metadata_expanded)
595
-
596
- # Concatenate the original DataFrame with the expanded metadata DataFrame
597
- results_df_out = pd.concat([length_more_limit.drop('metadatas', axis=1), df_metadata_expanded], axis=1)
598
-
599
- results_df_out = results_df_out.rename(columns={"documents":orig_df_col})
600
-
601
- results_df_out = results_df_out.drop(["page_section", "row", "source", "id"], axis=1, errors="ignore")
602
- results_df_out['distances'] = round(results_df_out['distances'].astype(float), 3)
603
-
604
- # Join back to original df
605
- # results_df_out = orig_df.merge(length_more_limit[['ids', 'distances']], left_index = True, right_on = "ids", how="inner").sort_values("distances")
606
-
607
- # Join on additional files
608
- if in_join_file:
609
- join_filename = in_join_file.name
610
-
611
- # Import data
612
- join_df = read_file(join_filename)
613
- join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
614
-
615
- # Duplicates dropped so as not to expand out dataframe
616
- join_df = join_df.drop_duplicates(in_join_column)
617
-
618
- results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
619
-
620
- results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
621
-
622
- return results_df_out
623
-
624
- def jina_simple_retrieval(new_question_kworded, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
625
- vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None, device = torch_device, embeddings = embeddings, progress=gr.Progress()): # ,vectorstore, embeddings
626
-
627
- print("vectorstore loaded: ", vectorstore)
628
-
629
- # Convert it to a PyTorch tensor and transfer to GPU
630
- vectorstore_tensor = tensor(vectorstore).to(device)
631
-
632
- # Load the sentence transformer model and move it to GPU
633
- embeddings = embeddings.to(device)
634
-
635
- # Encode the query using the sentence transformer and convert to a PyTorch tensor
636
- query = embeddings.encode(new_question_kworded)
637
- query_tensor = tensor(query).to(device)
638
-
639
- if query_tensor.dim() == 1:
640
- query_tensor = query_tensor.unsqueeze(0) # Reshape to 2D with one row
641
-
642
- # Normalize the query tensor and vectorstore tensor
643
- query_norm = query_tensor / query_tensor.norm(dim=1, keepdim=True)
644
- vectorstore_norm = vectorstore_tensor / vectorstore_tensor.norm(dim=1, keepdim=True)
645
-
646
- # Calculate cosine similarities (batch processing)
647
- cosine_similarities = mm(query_norm, vectorstore_norm.T)
648
-
649
- # Flatten the tensor to a 1D array
650
- cosine_similarities = cosine_similarities.flatten()
651
-
652
- # Convert to a NumPy array if it's still a PyTorch tensor
653
- cosine_similarities = cosine_similarities.cpu().numpy()
654
-
655
- # Create a Pandas Series
656
- cosine_similarities_series = pd.Series(cosine_similarities)
657
-
658
- # Pull out relevent info from docs
659
- page_contents = [doc.page_content for doc in docs]
660
- page_meta = [doc.metadata for doc in docs]
661
- ids_range = range(0,len(page_contents))
662
- ids = [str(element) for element in ids_range]
663
-
664
- df_docs = pd.DataFrame(data={"ids": ids,
665
- "documents": page_contents,
666
- "metadatas":page_meta,
667
- "distances":cosine_similarities_series}).sort_values("distances", ascending=False).iloc[0:k_val,:]
668
-
669
-
670
- results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
671
-
672
- results_df_name = "semantic_search_result.csv"
673
- results_df_out.to_csv(results_df_name, index= None)
674
- results_first_text = results_df_out.iloc[0, 1]
675
-
676
- return results_first_text, results_df_name
677
-
678
- def chroma_retrieval(new_question_kworded:str, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
679
- vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None): # ,vectorstore, embeddings
680
-
681
- query = embeddings.encode(new_question_kworded).tolist()
682
-
683
- docs = vectorstore.query(
684
- query_embeddings=query,
685
- n_results= k_val # No practical limit on number of responses returned
686
- #where={"metadata_field": "is_equal_to_this"},
687
- #where_document={"$contains":"search_string"}
688
- )
689
-
690
- df_docs = pd.DataFrame(data={'ids': docs['ids'][0],
691
- 'documents': docs['documents'][0],
692
- 'metadatas':docs['metadatas'][0],
693
- 'distances':docs['distances'][0]#,
694
- #'embeddings': docs['embeddings']
695
- })
696
-
697
- results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
698
-
699
- results_df_name = "semantic_search_result.csv"
700
- results_df_out.to_csv(results_df_name, index= None)
701
- results_first_text = results_df_out[orig_df_col].iloc[0]
702
-
703
- return results_first_text, results_df_name
704
 
 
 
 
705
 
706
  ## Gradio app - BM25 search
707
  block = gr.Blocks(theme = gr.themes.Base())
@@ -716,7 +26,6 @@ with block:
716
 
717
  k_val = gr.State(9999)
718
  out_passages = gr.State(9999)
719
- vec_score_cut_off = gr.State(0.7)
720
  vec_weight = gr.State(1)
721
 
722
  docs_keep_as_doc_state = gr.State()
@@ -740,11 +49,17 @@ depends on factors such as the type of documents or queries. Information taken f
740
 
741
  gr.Markdown(
742
  """
743
- # Fast text search
744
- Enter a text query below to search through a text data column and find relevant terms. It will only find terms containing the exact text you enter. Your data should contain at least 20 entries for the search to consistently return results.
745
  """)
746
 
747
  with gr.Tab(label="Keyword search"):
 
 
 
 
 
 
748
  with gr.Row():
749
  current_source = gr.Textbox(label="Current data source(s)", value="None")
750
 
@@ -760,7 +75,7 @@ depends on factors such as the type of documents or queries. Information taken f
760
  with gr.Accordion(label = "Search data", open=True):
761
  with gr.Row():
762
  keyword_query = gr.Textbox(label="Enter your search term")
763
- mod_query = gr.Textbox(label="Cleaned search term (the terms that are passed to the search engine)")
764
 
765
  keyword_search_button = gr.Button(value="Search text")
766
 
@@ -768,12 +83,18 @@ depends on factors such as the type of documents or queries. Information taken f
768
  output_single_text = gr.Textbox(label="Top result")
769
  output_file = gr.File(label="File output")
770
 
771
- with gr.Tab("Fuzzy/semantic search"):
 
 
 
 
 
 
772
  with gr.Row():
773
  current_source_semantic = gr.Textbox(label="Current data source(s)", value="None")
774
 
775
  with gr.Accordion("Load in data", open = True):
776
- in_semantic_file = gr.File(label="Upload data file for semantic search", file_count= 'multiple', file_types = ['.parquet', '.csv', '.npy', '.npz'])
777
 
778
  with gr.Row():
779
  in_semantic_column = gr.Dropdown(label="Enter the name of the text column in the data file to search")
@@ -789,11 +110,13 @@ depends on factors such as the type of documents or queries. Information taken f
789
  semantic_output_file = gr.File(label="File output")
790
 
791
  with gr.Tab(label="Advanced options"):
792
- with gr.Accordion(label="Data load / save options", open = False):
793
- #with gr.Row():
794
- in_clean_data = gr.Dropdown(label = "Clean text during load (remove tags, stem words). This will take some time!", value="No", choices=["Yes", "No"])
 
 
795
  #save_clean_data_button = gr.Button(value = "Save loaded data to file", scale = 1)
796
- with gr.Accordion(label="Search options", open = False):
797
  with gr.Row():
798
  in_k1 = gr.Slider(label = "k1 value", value = 1.5, minimum = 0.1, maximum = 5, step = 0.1, scale = 3)
799
  in_k1_button = gr.Button(value = "k1 value info", scale = 1)
@@ -808,6 +131,8 @@ depends on factors such as the type of documents or queries. Information taken f
808
  in_no_search_results_button = gr.Button(value = "Search results number info", scale = 1)
809
  with gr.Row():
810
  in_search_param_button = gr.Button(value="Load search parameters (Need to click this if you changed anything above)")
 
 
811
  with gr.Accordion(label = "Join on additional dataframes to results", open = False):
812
  in_join_file = gr.File(label="Upload your data to join here")
813
  in_join_column = gr.Dropdown(label="Column to join in new data frame")
@@ -823,29 +148,28 @@ depends on factors such as the type of documents or queries. Information taken f
823
 
824
  ### BM25 SEARCH ###
825
  # Update dropdowns upon initial file load
826
- in_bm25_file.upload(put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column])
827
  in_join_file.upload(put_columns_in_join_df, inputs=[in_join_file, in_join_column], outputs=[in_join_column])
828
 
829
  # Load in BM25 data
830
- load_bm25_data_button.click(fn=prepare_input_data, inputs=[in_bm25_file, in_bm25_column, in_clean_data], outputs=[corpus_state, load_finished_message, data_state, output_file, output_file]).\
831
- then(fn=prepare_bm25, inputs=[corpus_state, in_k1, in_b, in_alpha], outputs=[load_finished_message]).\
832
- then(fn=put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column])
833
 
834
  # BM25 search functions on click or enter
835
- keyword_search_button.click(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file, mod_query], api_name="keyword")
836
- keyword_query.submit(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file, mod_query])
837
 
838
  ### SEMANTIC SEARCH ###
839
  # Load in a csv/excel file for semantic search
840
- in_semantic_file.upload(put_columns_in_df, inputs=[in_semantic_file, in_semantic_column], outputs=[in_semantic_column, in_clean_data, search_df_join_column])
841
- load_semantic_data_button.click(ing.parse_csv_or_excel, inputs=[in_semantic_file, in_semantic_column], outputs=[ingest_text, current_source_semantic, semantic_load_progress]).\
842
- then(ing.csv_excel_text_to_docs, inputs=[ingest_text, in_semantic_column], outputs=[ingest_docs, semantic_load_progress]).\
843
- then(docs_to_np_array, inputs=[ingest_docs, in_semantic_file], outputs=[semantic_load_progress, vectorstore_state, semantic_output_file, semantic_output_file])
844
 
845
  # Semantic search query
846
- semantic_submit.click(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, vec_score_cut_off, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file], api_name="semantic")
847
-
848
- semantic_query.submit(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, vec_score_cut_off, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file])
849
 
850
  # Dummy functions just to get dropdowns to work correctly with Gradio 3.50
851
  in_bm25_column.change(dummy_function, in_bm25_column, None)
 
1
+ from typing import Type
2
+ from search_funcs.bm25_functions import prepare_bm25_input_data, prepare_bm25, bm25_search
3
+ from search_funcs.semantic_ingest_functions import parse_csv_or_excel, csv_excel_text_to_docs
4
+ from search_funcs.semantic_functions import docs_to_jina_embed_np_array, jina_simple_retrieval
5
+ from search_funcs.helper_functions import dummy_function, display_info, put_columns_in_df, put_columns_in_join_df, get_temp_folder_path, empty_folder
 
 
 
 
 
 
 
 
 
 
6
 
7
  import gradio as gr
8
  import pandas as pd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ PandasDataFrame = Type[pd.DataFrame]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
+ # Attempt to delete temporary files generated by previous use of the app (as the files can be very big!)
13
+ temp_folder_path = get_temp_folder_path()
14
+ empty_folder(temp_folder_path)
15
 
16
  ## Gradio app - BM25 search
17
  block = gr.Blocks(theme = gr.themes.Base())
 
26
 
27
  k_val = gr.State(9999)
28
  out_passages = gr.State(9999)
 
29
  vec_weight = gr.State(1)
30
 
31
  docs_keep_as_doc_state = gr.State()
 
49
 
50
  gr.Markdown(
51
  """
52
+ # Data text search
53
+ Search through long-form text fields in your tabular data. Either for exact, specific terms (Keyword search), or thematic, 'fuzzy' search (Semantic search). More instructions are provided in the relevant tabs below.
54
  """)
55
 
56
  with gr.Tab(label="Keyword search"):
57
+ gr.Markdown(
58
+ """
59
+ **Exact term keyword search**
60
+
61
+ 1. Load in data file (ideally a file with '_cleaned' at the end of the name), with (optionally) the '...tokenised_data.parquet' in the same folder to save loading time. 2. Select the field in your data to search. Ideally this will have the suffix '_cleaned' to show that html tags have been removed. 3. Wait for the data file to be prepared for search. 4. Enter the search term in the relevant box below and press Enter/click on 'Search text'. 4. Your search results will be saved in a csv file and will be presented in the 'File output' area below.
62
+ """)
63
  with gr.Row():
64
  current_source = gr.Textbox(label="Current data source(s)", value="None")
65
 
 
75
  with gr.Accordion(label = "Search data", open=True):
76
  with gr.Row():
77
  keyword_query = gr.Textbox(label="Enter your search term")
78
+ #mod_query = gr.Textbox(label="Cleaned search term (the terms that are passed to the search engine)")
79
 
80
  keyword_search_button = gr.Button(value="Search text")
81
 
 
83
  output_single_text = gr.Textbox(label="Top result")
84
  output_file = gr.File(label="File output")
85
 
86
+ with gr.Tab("Semantic search"):
87
+ gr.Markdown(
88
+ """
89
+ **Thematic/semantic search**
90
+
91
+ This search type enables you to search for broader themes (e.g. happiness, nature) and the search will pick out text passages that relate to these themes even if they don't contain the exact words. 1. Load in data file (ideally a file with '_cleaned' at the end of the name), with (optionally) the 'semantic_search_embeddings.npz' in the same folder to save loading time. 2. Select the field in your data to search. Ideally this will have the suffix '_cleaned' to show that html tags have been removed. 3. Wait for the data file to be prepared for search. 4. Enter the search term in the 'Enter semantic search query here' box below and press Enter/click on 'Start semantic search'. 4. Your search results will be saved in a csv file and will be presented in the 'File output' area below.
92
+ """)
93
  with gr.Row():
94
  current_source_semantic = gr.Textbox(label="Current data source(s)", value="None")
95
 
96
  with gr.Accordion("Load in data", open = True):
97
+ in_semantic_file = gr.File(label="Upload data file for semantic search", file_count= 'multiple', file_types = ['.parquet', '.csv', '.npy', '.npz', '.pkl', '.pkl.gz'])
98
 
99
  with gr.Row():
100
  in_semantic_column = gr.Dropdown(label="Enter the name of the text column in the data file to search")
 
110
  semantic_output_file = gr.File(label="File output")
111
 
112
  with gr.Tab(label="Advanced options"):
113
+ with gr.Accordion(label="Data load / save options", open = True):
114
+ with gr.Row():
115
+ in_clean_data = gr.Dropdown(label = "Clean text during load (remove html tags). For large files this may take some time!", value="No", choices=["Yes", "No"])
116
+ return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation. Files can be loaded in to save processing time in future.", value="No", choices=["Yes", "No"])
117
+ embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp for smaller files with less accuracy.", value="No", choices=["Yes", "No"])
118
  #save_clean_data_button = gr.Button(value = "Save loaded data to file", scale = 1)
119
+ with gr.Accordion(label="Keyword search options", open = False):
120
  with gr.Row():
121
  in_k1 = gr.Slider(label = "k1 value", value = 1.5, minimum = 0.1, maximum = 5, step = 0.1, scale = 3)
122
  in_k1_button = gr.Button(value = "k1 value info", scale = 1)
 
131
  in_no_search_results_button = gr.Button(value = "Search results number info", scale = 1)
132
  with gr.Row():
133
  in_search_param_button = gr.Button(value="Load search parameters (Need to click this if you changed anything above)")
134
+ with gr.Accordion(label="Semantic search options", open = False):
135
+ semantic_min_distance = gr.Slider(label = "Minimum distance score for search result to be included", value = 0.7, minimum=0, maximum=0.95, step=0.01)
136
  with gr.Accordion(label = "Join on additional dataframes to results", open = False):
137
  in_join_file = gr.File(label="Upload your data to join here")
138
  in_join_column = gr.Dropdown(label="Column to join in new data frame")
 
148
 
149
  ### BM25 SEARCH ###
150
  # Update dropdowns upon initial file load
151
+ in_bm25_file.upload(put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column, data_state])
152
  in_join_file.upload(put_columns_in_join_df, inputs=[in_join_file, in_join_column], outputs=[in_join_column])
153
 
154
  # Load in BM25 data
155
+ load_bm25_data_button.click(fn=prepare_bm25_input_data, inputs=[in_bm25_file, in_bm25_column, data_state, in_clean_data, return_intermediate_files], outputs=[corpus_state, load_finished_message, data_state, output_file, output_file, current_source]).\
156
+ then(fn=prepare_bm25, inputs=[corpus_state, in_k1, in_b, in_alpha], outputs=[load_finished_message])#.\
157
+ #then(fn=put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column])
158
 
159
  # BM25 search functions on click or enter
160
+ keyword_search_button.click(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file], api_name="keyword")
161
+ keyword_query.submit(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file])
162
 
163
  ### SEMANTIC SEARCH ###
164
  # Load in a csv/excel file for semantic search
165
+ in_semantic_file.upload(put_columns_in_df, inputs=[in_semantic_file, in_semantic_column], outputs=[in_semantic_column, in_clean_data, search_df_join_column, data_state])
166
+ load_semantic_data_button.click(parse_csv_or_excel, inputs=[in_semantic_file, data_state, in_semantic_column], outputs=[ingest_text, current_source_semantic, semantic_load_progress]).\
167
+ then(csv_excel_text_to_docs, inputs=[ingest_text, in_semantic_file, in_semantic_column, in_clean_data, return_intermediate_files], outputs=[ingest_docs, semantic_load_progress]).\
168
+ then(docs_to_jina_embed_np_array, inputs=[ingest_docs, in_semantic_file, return_intermediate_files, embedding_super_compress], outputs=[semantic_load_progress, vectorstore_state, semantic_output_file])
169
 
170
  # Semantic search query
171
+ semantic_submit.click(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, semantic_min_distance, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file], api_name="semantic")
172
+ semantic_query.submit(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, semantic_min_distance, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file])
 
173
 
174
  # Dummy functions just to get dropdowns to work correctly with Gradio 3.50
175
  in_bm25_column.change(dummy_function, in_bm25_column, None)
hook-en_core_web_sm.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ from PyInstaller.utils.hooks import collect_data_files
2
+
3
+ hiddenimports = [
4
+ 'en_core_web_sm'
5
+ ]
6
+
7
+ # Use collect_data_files to find data files. Replace 'en_core_web_sm' with the correct package name if it's different.
8
+ datas = collect_data_files('en_core_web_sm')
hook-gradio.py CHANGED
@@ -1,8 +1,7 @@
1
  from PyInstaller.utils.hooks import collect_data_files
2
 
3
  hiddenimports = [
4
- 'gradio',
5
- # Add any other submodules that PyInstaller doesn't detect
6
  ]
7
 
8
  # Use collect_data_files to find data files. Replace 'gradio' with the correct package name if it's different.
 
1
  from PyInstaller.utils.hooks import collect_data_files
2
 
3
  hiddenimports = [
4
+ 'gradio'
 
5
  ]
6
 
7
  # Use collect_data_files to find data files. Replace 'gradio' with the correct package name if it's different.
how_to_create_exe_dist.txt CHANGED
@@ -4,18 +4,26 @@
4
 
5
  3. cd to this folder. Install packages from requirements.txt using 'pip install -r requirements.txt'
6
 
 
 
7
  4. In file explorer, navigate to the miniconda/envs/new_env/Lib/site-packages/gradio-client/ folder
8
 
9
  5. Copy types.json from the gradio_client folder to the folder containing the data_text_search.py file
10
 
11
- 6. pip install pyinstaller
 
 
 
 
12
 
13
- 7. In command line, cd to this folder. Then run the following 'python -m PyInstaller --additional-hooks-dir=. --hidden-import pyarrow.vendored.version --add-data="types.json;gradio_client" --clean --onefile --clean --name DataSearchApp data_text_search.py'
 
14
 
15
- 8. A 'dist' folder will be created with the executable inside along with all dependencies('dist\data_text_search').
 
16
 
17
- 9. In file explorer, navigate to the miniconda/envs/new_env/Lib/site-packages/gradio/ folder. Copy the entire folder. Paste this into the new distributable subfolder 'dist\data_text_search\_internal'
18
 
19
- 10. In 'dist\data_text_search' try double clicking on the .exe file. After a short delay, the command prompt should inform you about the ip address of the app that is now running. Copy the ip address, but do not close this window.
20
 
21
  11. In an Internet browser, navigate to the indicated IP address. The app should now be running in your browser window.
 
4
 
5
  3. cd to this folder. Install packages from requirements.txt using 'pip install -r requirements.txt'
6
 
7
+ NOTE: for ensuring that spaCy models are loaded into the program correctly in requirements.txt, follow this guide: https://spacy.io/usage/models#models-download
8
+
9
  4. In file explorer, navigate to the miniconda/envs/new_env/Lib/site-packages/gradio-client/ folder
10
 
11
  5. Copy types.json from the gradio_client folder to the folder containing the data_text_search.py file
12
 
13
+ 6. If necessary, create hook- files to tell pyinstaller to include specific packages in the exe build. Examples are provided for gradio and en_core_web_sm (a spaCy model).
14
+
15
+ 7. pip install pyinstaller
16
+
17
+ 8. In command line, cd to the folder that contains app.py. Then run the following:
18
 
19
+ For one single file:
20
+ python -m PyInstaller --additional-hooks-dir=. --hidden-import pyarrow.vendored.version --add-data="types.json;gradio_client" --add-data "model;model" --onefile --clean --noconfirm --upx-dir="C:\Program Files\UPX\upx-4.2.2-win64" --name DataSearchApp_0.1 app.py
21
 
22
+ For a small exe with a folder of dependencies:
23
+ python -m PyInstaller --additional-hooks-dir=. --hidden-import pyarrow.vendored.version --add-data="types.json;gradio_client" --add-data "model;model" --clean --noconfirm --upx-dir="C:\Program Files\UPX\upx-4.2.2-win64" --name DataSearchApp_0.1 app.py
24
 
25
+ 9. A 'dist' folder will be created with the executable inside along with all dependencies('dist\data_text_search').
26
 
27
+ 10. In 'dist\data_text_search' try double clicking on the .exe file. After a short delay, the command prompt should inform you about the IP address of the app that is now running. Copy the IP address. **Do not close this window!**
28
 
29
  11. In an Internet browser, navigate to the indicated IP address. The app should now be running in your browser window.
requirements.txt CHANGED
@@ -1,13 +1,9 @@
1
- pandas
2
- nltk
3
- pyarrow
4
- openpyxl
5
- transformers
6
- langchain
7
- chromadb
8
- torch
9
- accelerate
10
- sentence-transformers
11
- spacy
12
- polars
13
  gradio==3.50.0
 
1
+ pandas==2.1.4
2
+ polars==0.20.3
3
+ pyarrow==14.0.2
4
+ openpyxl==3.1.2
5
+ transformers==4.32.1
6
+ torch==2.1.2
7
+ spacy==3.7.2
8
+ en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 
 
 
 
9
  gradio==3.50.0
search_funcs/{fast_bm25.py β†’ bm25_functions.py} RENAMED
@@ -3,14 +3,44 @@ import heapq
3
  import math
4
  import pickle
5
  import sys
 
 
6
  from numpy import inf
7
  import gradio as gr
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  PARAM_K1 = 1.5
10
  PARAM_B = 0.75
11
  IDF_CUTOFF = -inf
12
 
13
- # Built off https://github.com/Inspirateur/Fast-BM25
14
 
15
  class BM25:
16
  """Fast Implementation of Best Matching 25 ranking function.
@@ -196,3 +226,201 @@ class BM25:
196
  def load(filename):
197
  with open(f"{filename}.pkl", "rb") as fsave:
198
  return pickle.load(fsave)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  import math
4
  import pickle
5
  import sys
6
+ import time
7
+ import pandas as pd
8
  from numpy import inf
9
  import gradio as gr
10
 
11
+ from datetime import datetime
12
+
13
+ today_rev = datetime.now().strftime("%Y%m%d")
14
+
15
+ from search_funcs.clean_funcs import initial_clean # get_lemma_tokens, stem_sentence
16
+ from search_funcs.helper_functions import read_file, get_file_path_end_with_ext, get_file_path_end
17
+
18
+ # Load the SpaCy model
19
+ from spacy.cli import download
20
+ import spacy
21
+ spacy.prefer_gpu()
22
+
23
+ #os.system("python -m spacy download en_core_web_sm")
24
+ try:
25
+ import en_core_web_sm
26
+ nlp = en_core_web_sm.load()
27
+ print("Successfully imported spaCy model")
28
+ #nlp = spacy.load("en_core_web_sm")
29
+ #print(nlp._path)
30
+ except:
31
+ download("en_core_web_sm")
32
+ nlp = spacy.load("en_core_web_sm")
33
+ print("Successfully imported spaCy model")
34
+ #print(nlp._path)
35
+
36
+ # including punctuation rules and exceptions
37
+ tokenizer = nlp.tokenizer
38
+
39
  PARAM_K1 = 1.5
40
  PARAM_B = 0.75
41
  IDF_CUTOFF = -inf
42
 
43
+ # Class built off https://github.com/Inspirateur/Fast-BM25
44
 
45
  class BM25:
46
  """Fast Implementation of Best Matching 25 ranking function.
 
226
  def load(filename):
227
  with open(f"{filename}.pkl", "rb") as fsave:
228
  return pickle.load(fsave)
229
+
230
+ # These following functions are my own work
231
+
232
+ def prepare_bm25_input_data(in_file, text_column, data_state, clean="No", return_intermediate_files = "No", progress=gr.Progress()):
233
+
234
+ file_list = [string.name for string in in_file]
235
+
236
+ #print(file_list)
237
+
238
+ data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
239
+
240
+ data_file_name = data_file_names[0]
241
+
242
+ df = data_state #read_file(data_file_name)
243
+ data_file_out_name = get_file_path_end_with_ext(data_file_name)
244
+ data_file_out_name_no_ext = get_file_path_end(data_file_name)
245
+
246
+ ## Load in pre-tokenised corpus if exists
247
+ tokenised_df = pd.DataFrame()
248
+
249
+ tokenised_file_names = [string for string in file_list if "tokenised" in string]
250
+
251
+ if tokenised_file_names:
252
+ tokenised_df = read_file(tokenised_file_names[0])
253
+ #print("Tokenised df is: ", tokenised_df.head())
254
+
255
+ #df = pd.read_parquet(file_in.name)
256
+
257
+ df[text_column] = df[text_column].astype(str).str.lower()
258
+
259
+ if clean == "Yes":
260
+ clean_tic = time.perf_counter()
261
+ print("Starting data clean.")
262
+
263
+ df = df.drop_duplicates(text_column)
264
+ df_list = list(df[text_column])
265
+ df_list = initial_clean(df_list)
266
+
267
+ # Save to file if you have cleaned the data
268
+ out_file_name, text_column = save_prepared_bm25_data(data_file_name, df_list, df, text_column)
269
+
270
+ clean_toc = time.perf_counter()
271
+ clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
272
+ print(clean_time_out)
273
+
274
+ else:
275
+ # Don't clean or save file to disk
276
+ df_list = list(df[text_column])
277
+ print("No data cleaning performed.")
278
+ out_file_name = None
279
+
280
+ # Tokenise data. If tokenised df already exists, no need to do anything
281
+
282
+ if not tokenised_df.empty:
283
+ corpus = tokenised_df.iloc[:,0].tolist()
284
+ print("Tokeniser loaded from file.")
285
+ #print("Corpus is: ", corpus[0:5])
286
+
287
+ # If doesn't already exist, tokenize texts in batches
288
+ else:
289
+ tokeniser_tic = time.perf_counter()
290
+ corpus = []
291
+ batch_size = 256
292
+ for doc in tokenizer.pipe(progress.tqdm(df_list, desc = "Tokenising text", unit = "rows"), batch_size=batch_size):
293
+ corpus.append([token.text for token in doc])
294
+
295
+ tokeniser_toc = time.perf_counter()
296
+ tokenizer_time_out = f"Tokenising the text took {tokeniser_toc - tokeniser_tic:0.1f} seconds."
297
+ print(tokenizer_time_out)
298
+
299
+
300
+ if len(df_list) >= 20:
301
+ message = "Data loaded"
302
+ else:
303
+ message = "Data loaded. Warning: dataset may be too short to get consistent search results."
304
+
305
+ if return_intermediate_files == "Yes":
306
+ tokenised_data_file_name = data_file_out_name_no_ext + "_" + "keyword_search_tokenised_data.parquet"
307
+ pd.DataFrame(data={"Corpus":corpus}).to_parquet(tokenised_data_file_name)
308
+
309
+ return corpus, message, df, out_file_name, tokenised_data_file_name, data_file_out_name
310
+
311
+ return corpus, message, df, out_file_name, None, data_file_out_name # tokenised_data_file_name
312
+
313
+ def save_prepared_bm25_data(in_file_name, prepared_text_list, in_df, in_bm25_column):
314
+
315
+ # Check if the list and the dataframe have the same length
316
+ if len(prepared_text_list) != len(in_df):
317
+ raise ValueError("The length of 'prepared_text_list' and 'in_df' must match.")
318
+
319
+ file_end = ".parquet"
320
+
321
+ file_name = get_file_path_end(in_file_name) + "_cleaned" + file_end
322
+
323
+ new_text_column = in_bm25_column + "_cleaned"
324
+ prepared_text_df = pd.DataFrame(data={new_text_column:prepared_text_list})
325
+
326
+ # Drop original column from input file to reduce file size
327
+ in_df = in_df.drop(in_bm25_column, axis = 1)
328
+
329
+ prepared_df = pd.concat([in_df, prepared_text_df], axis = 1)
330
+
331
+ if file_end == ".csv":
332
+ prepared_df.to_csv(file_name)
333
+ elif file_end == ".parquet":
334
+ prepared_df.to_parquet(file_name)
335
+ else: file_name = None
336
+
337
+ return file_name, new_text_column
338
+
339
+ def prepare_bm25(corpus, k1=1.5, b = 0.75, alpha=-5):
340
+ #bm25.save("saved_df_bm25")
341
+ #bm25 = BM25.load(re.sub(r'\.pkl$', '', file_in.name))
342
+
343
+ print("Preparing BM25 corpus")
344
+
345
+ global bm25
346
+ bm25 = BM25(corpus, k1=k1, b=b, alpha=alpha)
347
+
348
+ message = "Search parameters loaded."
349
+
350
+ print(message)
351
+
352
+ return message
353
+
354
+ def convert_bm25_query_to_tokens(free_text_query, clean="No"):
355
+ '''
356
+ Split open text query into tokens and then lemmatise to get the core of the word. Currently 'clean' has no effect.
357
+ '''
358
+
359
+ if clean=="Yes":
360
+ split_query = tokenizer(free_text_query.lower())
361
+ out_query = [token.text for token in split_query]
362
+ #out_query = stem_sentence(out_query)
363
+ else:
364
+ split_query = tokenizer(free_text_query.lower())
365
+ out_query = [token.text for token in split_query]
366
+
367
+ print("Search query out is:", out_query)
368
+
369
+ if isinstance(out_query,str):
370
+ print("Converting string")
371
+ out_query = [out_query]
372
+
373
+ return out_query
374
+
375
+ def bm25_search(free_text_query, in_no_search_results, original_data, text_column, clean = "No", in_join_file = None, in_join_column = "", search_df_join_column = ""):
376
+
377
+ # Prepare query
378
+ if (clean == "Yes") | (text_column.endswith("_cleaned")):
379
+ token_query = convert_bm25_query_to_tokens(free_text_query, clean="Yes")
380
+ else:
381
+ token_query = convert_bm25_query_to_tokens(free_text_query, clean="No")
382
+
383
+ #print(token_query)
384
+
385
+ # Perform search
386
+ print("Searching")
387
+
388
+ results_index, results_text, results_scores = bm25.extract_documents_and_scores(token_query, bm25.corpus, n=in_no_search_results) #bm25.corpus #original_data[text_column]
389
+ if not results_index:
390
+ return "No search results found", None, token_query
391
+
392
+ print("Search complete")
393
+
394
+ # Prepare results and export
395
+ joined_texts = [' '.join(inner_list) for inner_list in results_text]
396
+ results_df = pd.DataFrame(data={"index": results_index,
397
+ "search_text": joined_texts,
398
+ "search_score_abs": results_scores})
399
+ results_df['search_score_abs'] = abs(round(results_df['search_score_abs'], 2))
400
+ results_df_out = results_df[['index', 'search_text', 'search_score_abs']].merge(original_data,left_on="index", right_index=True, how="left")#.drop("index", axis=1)
401
+
402
+ # Join on additional files
403
+ if in_join_file:
404
+ join_filename = in_join_file.name
405
+
406
+ # Import data
407
+ join_df = read_file(join_filename)
408
+ join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
409
+ results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
410
+
411
+ # Duplicates dropped so as not to expand out dataframe
412
+ join_df = join_df.drop_duplicates(in_join_column)
413
+
414
+ results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
415
+
416
+ # Reorder results by score
417
+ results_df_out = results_df_out.sort_values('search_score_abs', ascending=False)
418
+
419
+ # Out file
420
+ results_df_name = "keyword_search_result_" + today_rev + ".csv"
421
+ results_df_out.to_csv(results_df_name, index= None)
422
+ results_first_text = results_df_out[text_column].iloc[0]
423
+
424
+ print("Returning results")
425
+
426
+ return results_first_text, results_df_name, token_query
search_funcs/chatfuncs.py DELETED
@@ -1,393 +0,0 @@
1
- import re
2
- import os
3
- from typing import TypeVar, List
4
- import pandas as pd
5
-
6
-
7
- # Model packages
8
- import torch.cuda
9
-
10
- # Alternative model sources
11
- #from dataclasses import asdict, dataclass
12
-
13
- # Langchain functions
14
- from langchain.text_splitter import RecursiveCharacterTextSplitter
15
- from langchain.docstore.document import Document
16
-
17
- # For keyword extraction (not currently used)
18
- #import nltk
19
- #nltk.download('wordnet')
20
- from nltk.corpus import stopwords
21
- from nltk.tokenize import RegexpTokenizer
22
- from nltk.stem import WordNetLemmatizer
23
-
24
- # For Name Entity Recognition model
25
- #from span_marker import SpanMarkerModel # Not currently used
26
-
27
-
28
- import gradio as gr
29
-
30
- torch.cuda.empty_cache()
31
-
32
- PandasDataFrame = TypeVar('pd.core.frame.DataFrame')
33
-
34
- embeddings = None # global variable setup
35
- vectorstore = None # global variable setup
36
- model_type = None # global variable setup
37
-
38
- max_memory_length = 0 # How long should the memory of the conversation last?
39
-
40
- full_text = "" # Define dummy source text (full text) just to enable highlight function to load
41
-
42
- model = [] # Define empty list for model functions to run
43
- tokenizer = [] # Define empty list for model functions to run
44
-
45
- ## Highlight text constants
46
- hlt_chunk_size = 12
47
- hlt_strat = [" ", ". ", "! ", "? ", ": ", "\n\n", "\n", ", "]
48
- hlt_overlap = 4
49
-
50
- ## Initialise NER model ##
51
- ner_model = []#SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-multinerd") # Not currently used
52
-
53
-
54
- # Currently set gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
55
- if torch.cuda.is_available():
56
- torch_device = "cuda"
57
- gpu_layers = 0
58
- else:
59
- torch_device = "cpu"
60
- gpu_layers = 0
61
-
62
- print("Running on device:", torch_device)
63
- threads = 6 #torch.get_num_threads()
64
- print("CPU threads:", threads)
65
-
66
- # Vectorstore funcs
67
-
68
- # Prompt functions
69
-
70
- def write_out_metadata_as_string(metadata_in):
71
- metadata_string = [f"{' '.join(f'{k}: {v}' for k, v in d.items() if k != 'page_section')}" for d in metadata_in] # ['metadata']
72
- return metadata_string
73
-
74
-
75
- def determine_file_type(file_path):
76
- """
77
- Determine the file type based on its extension.
78
-
79
- Parameters:
80
- file_path (str): Path to the file.
81
-
82
- Returns:
83
- str: File extension (e.g., '.pdf', '.docx', '.txt', '.html').
84
- """
85
- return os.path.splitext(file_path)[1].lower()
86
-
87
-
88
- def create_doc_df(docs_keep_out):
89
- # Extract content and metadata from 'winning' passages.
90
- content=[]
91
- meta=[]
92
- meta_url=[]
93
- page_section=[]
94
- score=[]
95
-
96
- doc_df = pd.DataFrame()
97
-
98
-
99
-
100
- for item in docs_keep_out:
101
- content.append(item[0].page_content)
102
- meta.append(item[0].metadata)
103
- meta_url.append(item[0].metadata['source'])
104
-
105
- file_extension = determine_file_type(item[0].metadata['source'])
106
- if (file_extension != ".csv") & (file_extension != ".xlsx"):
107
- page_section.append(item[0].metadata['page_section'])
108
- else: page_section.append("")
109
- score.append(item[1])
110
-
111
- # Create df from 'winning' passages
112
-
113
- doc_df = pd.DataFrame(list(zip(content, meta, page_section, meta_url, score)),
114
- columns =['page_content', 'metadata', 'page_section', 'meta_url', 'score'])
115
-
116
- docs_content = doc_df['page_content'].astype(str)
117
- doc_df['full_url'] = "https://" + doc_df['meta_url']
118
-
119
- return doc_df
120
-
121
-
122
- def get_expanded_passages(vectorstore, docs, width):
123
-
124
- """
125
- Extracts expanded passages based on given documents and a width for context.
126
-
127
- Parameters:
128
- - vectorstore: The primary data source.
129
- - docs: List of documents to be expanded.
130
- - width: Number of documents to expand around a given document for context.
131
-
132
- Returns:
133
- - expanded_docs: List of expanded Document objects.
134
- - doc_df: DataFrame representation of expanded_docs.
135
- """
136
-
137
- from collections import defaultdict
138
-
139
- def get_docs_from_vstore(vectorstore):
140
- vector = vectorstore.docstore._dict
141
- return list(vector.items())
142
-
143
- def extract_details(docs_list):
144
- docs_list_out = [tup[1] for tup in docs_list]
145
- content = [doc.page_content for doc in docs_list_out]
146
- meta = [doc.metadata for doc in docs_list_out]
147
- return ''.join(content), meta[0], meta[-1]
148
-
149
- def get_parent_content_and_meta(vstore_docs, width, target):
150
- #target_range = range(max(0, target - width), min(len(vstore_docs), target + width + 1))
151
- target_range = range(max(0, target), min(len(vstore_docs), target + width + 1)) # Now only selects extra passages AFTER the found passage
152
- parent_vstore_out = [vstore_docs[i] for i in target_range]
153
-
154
- content_str_out, meta_first_out, meta_last_out = [], [], []
155
- for _ in parent_vstore_out:
156
- content_str, meta_first, meta_last = extract_details(parent_vstore_out)
157
- content_str_out.append(content_str)
158
- meta_first_out.append(meta_first)
159
- meta_last_out.append(meta_last)
160
- return content_str_out, meta_first_out, meta_last_out
161
-
162
- def merge_dicts_except_source(d1, d2):
163
- merged = {}
164
- for key in d1:
165
- if key != "source":
166
- merged[key] = str(d1[key]) + " to " + str(d2[key])
167
- else:
168
- merged[key] = d1[key] # or d2[key], based on preference
169
- return merged
170
-
171
- def merge_two_lists_of_dicts(list1, list2):
172
- return [merge_dicts_except_source(d1, d2) for d1, d2 in zip(list1, list2)]
173
-
174
- # Step 1: Filter vstore_docs
175
- vstore_docs = get_docs_from_vstore(vectorstore)
176
- doc_sources = {doc.metadata['source'] for doc, _ in docs}
177
- vstore_docs = [(k, v) for k, v in vstore_docs if v.metadata.get('source') in doc_sources]
178
-
179
- # Step 2: Group by source and proceed
180
- vstore_by_source = defaultdict(list)
181
- for k, v in vstore_docs:
182
- vstore_by_source[v.metadata['source']].append((k, v))
183
-
184
- expanded_docs = []
185
- for doc, score in docs:
186
- search_source = doc.metadata['source']
187
-
188
-
189
- #if file_type == ".csv" | file_type == ".xlsx":
190
- # content_str, meta_first, meta_last = get_parent_content_and_meta(vstore_by_source[search_source], 0, search_index)
191
-
192
- #else:
193
- search_section = doc.metadata['page_section']
194
- parent_vstore_meta_section = [doc.metadata['page_section'] for _, doc in vstore_by_source[search_source]]
195
- search_index = parent_vstore_meta_section.index(search_section) if search_section in parent_vstore_meta_section else -1
196
-
197
- content_str, meta_first, meta_last = get_parent_content_and_meta(vstore_by_source[search_source], width, search_index)
198
- meta_full = merge_two_lists_of_dicts(meta_first, meta_last)
199
-
200
- expanded_doc = (Document(page_content=content_str[0], metadata=meta_full[0]), score)
201
- expanded_docs.append(expanded_doc)
202
-
203
- doc_df = pd.DataFrame()
204
-
205
- doc_df = create_doc_df(expanded_docs) # Assuming you've defined the 'create_doc_df' function elsewhere
206
-
207
- return expanded_docs, doc_df
208
-
209
- def highlight_found_text(search_text: str, full_text: str, hlt_chunk_size:int=hlt_chunk_size, hlt_strat:List=hlt_strat, hlt_overlap:int=hlt_overlap) -> str:
210
- """
211
- Highlights occurrences of search_text within full_text.
212
-
213
- Parameters:
214
- - search_text (str): The text to be searched for within full_text.
215
- - full_text (str): The text within which search_text occurrences will be highlighted.
216
-
217
- Returns:
218
- - str: A string with occurrences of search_text highlighted.
219
-
220
- Example:
221
- >>> highlight_found_text("world", "Hello, world! This is a test. Another world awaits.")
222
- 'Hello, <mark style="color:black;">world</mark>! This is a test. Another <mark style="color:black;">world</mark> awaits.'
223
- """
224
-
225
- def extract_text_from_input(text, i=0):
226
- if isinstance(text, str):
227
- return text.replace(" ", " ").strip()
228
- elif isinstance(text, list):
229
- return text[i][0].replace(" ", " ").strip()
230
- else:
231
- return ""
232
-
233
- def extract_search_text_from_input(text):
234
- if isinstance(text, str):
235
- return text.replace(" ", " ").strip()
236
- elif isinstance(text, list):
237
- return text[-1][1].replace(" ", " ").strip()
238
- else:
239
- return ""
240
-
241
- full_text = extract_text_from_input(full_text)
242
- search_text = extract_search_text_from_input(search_text)
243
-
244
-
245
-
246
- text_splitter = RecursiveCharacterTextSplitter(
247
- chunk_size=hlt_chunk_size,
248
- separators=hlt_strat,
249
- chunk_overlap=hlt_overlap,
250
- )
251
- sections = text_splitter.split_text(search_text)
252
-
253
- found_positions = {}
254
- for x in sections:
255
- text_start_pos = 0
256
- while text_start_pos != -1:
257
- text_start_pos = full_text.find(x, text_start_pos)
258
- if text_start_pos != -1:
259
- found_positions[text_start_pos] = text_start_pos + len(x)
260
- text_start_pos += 1
261
-
262
- # Combine overlapping or adjacent positions
263
- sorted_starts = sorted(found_positions.keys())
264
- combined_positions = []
265
- if sorted_starts:
266
- current_start, current_end = sorted_starts[0], found_positions[sorted_starts[0]]
267
- for start in sorted_starts[1:]:
268
- if start <= (current_end + 10):
269
- current_end = max(current_end, found_positions[start])
270
- else:
271
- combined_positions.append((current_start, current_end))
272
- current_start, current_end = start, found_positions[start]
273
- combined_positions.append((current_start, current_end))
274
-
275
- # Construct pos_tokens
276
- pos_tokens = []
277
- prev_end = 0
278
- for start, end in combined_positions:
279
- if end-start > 15: # Only combine if there is a significant amount of matched text. Avoids picking up single words like 'and' etc.
280
- pos_tokens.append(full_text[prev_end:start])
281
- pos_tokens.append('<mark style="color:black;">' + full_text[start:end] + '</mark>')
282
- prev_end = end
283
- pos_tokens.append(full_text[prev_end:])
284
-
285
- return "".join(pos_tokens)
286
-
287
-
288
- # # Chat history functions
289
-
290
- def clear_chat(chat_history_state, sources, chat_message, current_topic):
291
- chat_history_state = []
292
- sources = ''
293
- chat_message = ''
294
- current_topic = ''
295
-
296
- return chat_history_state, sources, chat_message, current_topic
297
-
298
-
299
- # Keyword functions
300
-
301
- def remove_q_stopwords(question): # Remove stopwords from question. Not used at the moment
302
- # Prepare keywords from question by removing stopwords
303
- text = question.lower()
304
-
305
- # Remove numbers
306
- text = re.sub('[0-9]', '', text)
307
-
308
- tokenizer = RegexpTokenizer(r'\w+')
309
- text_tokens = tokenizer.tokenize(text)
310
- #text_tokens = word_tokenize(text)
311
- tokens_without_sw = [word for word in text_tokens if not word in stopwords]
312
-
313
- # Remove duplicate words while preserving order
314
- ordered_tokens = set()
315
- result = []
316
- for word in tokens_without_sw:
317
- if word not in ordered_tokens:
318
- ordered_tokens.add(word)
319
- result.append(word)
320
-
321
-
322
-
323
- new_question_keywords = ' '.join(result)
324
- return new_question_keywords
325
-
326
- def remove_q_ner_extractor(question):
327
-
328
- predict_out = ner_model.predict(question)
329
-
330
-
331
-
332
- predict_tokens = [' '.join(v for k, v in d.items() if k == 'span') for d in predict_out]
333
-
334
- # Remove duplicate words while preserving order
335
- ordered_tokens = set()
336
- result = []
337
- for word in predict_tokens:
338
- if word not in ordered_tokens:
339
- ordered_tokens.add(word)
340
- result.append(word)
341
-
342
-
343
-
344
- new_question_keywords = ' '.join(result).lower()
345
- return new_question_keywords
346
-
347
- def apply_lemmatize(text, wnl=WordNetLemmatizer()):
348
-
349
- def prep_for_lemma(text):
350
-
351
- # Remove numbers
352
- text = re.sub('[0-9]', '', text)
353
- print(text)
354
-
355
- tokenizer = RegexpTokenizer(r'\w+')
356
- text_tokens = tokenizer.tokenize(text)
357
- #text_tokens = word_tokenize(text)
358
-
359
- return text_tokens
360
-
361
- tokens = prep_for_lemma(text)
362
-
363
- def lem_word(word):
364
-
365
- if len(word) > 3: out_word = wnl.lemmatize(word)
366
- else: out_word = word
367
-
368
- return out_word
369
-
370
- return [lem_word(token) for token in tokens]
371
-
372
- def keybert_keywords(text, n, kw_model):
373
- tokens_lemma = apply_lemmatize(text)
374
- lemmatised_text = ' '.join(tokens_lemma)
375
-
376
- keywords_text = KeyBERT(model=kw_model).extract_keywords(lemmatised_text, stop_words='english', top_n=n,
377
- keyphrase_ngram_range=(1, 1))
378
- keywords_list = [item[0] for item in keywords_text]
379
-
380
- return keywords_list
381
-
382
- # Gradio functions
383
- def turn_off_interactivity(user_message, history):
384
- return gr.update(value="", interactive=False), history + [[user_message, None]]
385
-
386
- def restore_interactivity():
387
- return gr.update(interactive=True)
388
-
389
- def update_message(dropdown_value):
390
- return gr.Textbox.update(value=dropdown_value)
391
-
392
- def hide_block():
393
- return gr.Radio.update(visible=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
search_funcs/clean_funcs.py CHANGED
@@ -1,51 +1,14 @@
1
  # ## Some functions to clean text
2
 
3
- # ### Some other suggested cleaning approaches
4
- #
5
- # #### From here: https://shravan-kuchkula.github.io/topic-modeling/#interactive-plot-showing-results-of-k-means-clustering-lda-topic-modeling-and-sentiment-analysis
6
- #
7
- # - remove_hyphens
8
- # - tokenize_text
9
- # - remove_special_characters
10
- # - convert to lower case
11
- # - remove stopwords
12
- # - lemmatize the token
13
- # - remove short tokens
14
- # - keep only words in wordnet
15
- # - I ADDED ON - creating custom stopwords list
16
-
17
- # +
18
- # Create a custom stop words list
19
- import nltk
20
  import re
21
  import string
22
  import polars as pl
23
- from nltk.stem import WordNetLemmatizer
24
- from nltk.stem import PorterStemmer
25
- from nltk.corpus import wordnet as wn
26
- from nltk import word_tokenize
27
 
28
  # Add calendar months onto stop words
29
  import calendar
30
- from tqdm import tqdm
31
  import gradio as gr
32
 
33
- stemmer = PorterStemmer()
34
-
35
-
36
- nltk.download('stopwords')
37
- nltk.download('wordnet')
38
-
39
- #nltk.download('words')
40
- #nltk.download('names')
41
-
42
- #nltk.corpus.words.words('en')
43
-
44
- #from sklearn.feature_extraction import text
45
- # Adding common names to stopwords
46
-
47
- all_names = [x.lower() for x in list(nltk.corpus.names.words())]
48
-
49
  # Adding custom words to the stopwords
50
  custom_words = []
51
  my_stop_words = custom_words
@@ -58,72 +21,9 @@ cal_month = [x.lower() for x in cal_month]
58
  cal_month = [i for i in cal_month if i]
59
  #print(cal_month)
60
  custom_words.extend(cal_month)
61
-
62
- #my_stop_words = frozenset(text.ENGLISH_STOP_WORDS.union(custom_words).union(all_names))
63
- #custom_stopwords = my_stop_words
64
- # -
65
-
66
- # #### Some of my cleaning functions
67
- '''
68
- # +
69
- # Remove all html elements from the text. Inspired by this: https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
70
-
71
- def remove_email_start(text):
72
- cleanr = re.compile('.*importance:|.*subject:')
73
- cleantext = re.sub(cleanr, '', text)
74
- return cleantext
75
-
76
- def remove_email_end(text):
77
- cleanr = re.compile('kind regards.*|many thanks.*|sincerely.*')
78
- cleantext = re.sub(cleanr, '', text)
79
- return cleantext
80
-
81
- def cleanhtml(text):
82
- cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});|\xa0')
83
- cleantext = re.sub(cleanr, '', text)
84
- return cleantext
85
 
86
- ## The above doesn't work when there is no > at the end of the string to match the initial <. Trying this: <[^>]+> but needs work: https://stackoverflow.com/questions/2013124/regex-matching-up-to-the-first-occurrence-of-a-character
87
-
88
- # Remove all email addresses and numbers from the text
89
-
90
- def cleanemail(text):
91
- cleanr = re.compile('\S*@\S*\s?|\xa0')
92
- cleantext = re.sub(cleanr, '', text)
93
- return cleantext
94
-
95
- def cleannum(text):
96
- cleanr = re.compile(r'[0-9]+')
97
- cleantext = re.sub(cleanr, '', text)
98
- return cleantext
99
-
100
- def cleanpostcode(text):
101
- cleanr = re.compile(r'(\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2})|((GIR ?0A{2})\b$)|(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9]{1}?)$)|(\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?)\b$)')
102
- cleantext = re.sub(cleanr, '', text)
103
- return cleantext
104
-
105
- def cleanwarning(text):
106
- cleanr = re.compile('caution: this email originated from outside of the organization. do not click links or open attachments unless you recognize the sender and know the content is safe.')
107
- cleantext = re.sub(cleanr, '', text)
108
- return cleantext
109
-
110
-
111
- # -
112
-
113
- def initial_clean(texts):
114
- clean_texts = []
115
- for text in texts:
116
- text = remove_email_start(text)
117
- text = remove_email_end(text)
118
- text = cleanpostcode(text)
119
- text = remove_hyphens(text)
120
- text = cleanhtml(text)
121
- text = cleanemail(text)
122
- #text = cleannum(text)
123
- clean_texts.append(text)
124
- return clean_texts
125
- '''
126
 
 
127
  email_start_pattern_regex = r'.*importance:|.*subject:'
128
  email_end_pattern_regex = r'kind regards.*|many thanks.*|sincerely.*'
129
  html_pattern_regex = r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});|\xa0|&nbsp;'
@@ -143,130 +43,65 @@ postcode_pattern = re.compile(postcode_pattern_regex)
143
  warning_pattern = re.compile(warning_pattern_regex)
144
  nbsp_pattern = re.compile(nbsp_pattern_regex)
145
 
146
- def stem_sentence(sentence):
147
 
148
- words = sentence.split()
149
- stemmed_words = [stemmer.stem(word).lower().rstrip("'") for word in words]
150
- return stemmed_words
151
 
152
- def stem_sentences(sentences, progress=gr.Progress()):
153
- """Stem each sentence in a list of sentences."""
154
- stemmed_sentences = [stem_sentence(sentence) for sentence in progress.tqdm(sentences)]
155
- return stemmed_sentences
156
 
157
- def get_lemma_text(text):
158
- # Tokenize the input string into words
159
- tokens = word_tokenize(text)
160
 
161
- lemmas = []
162
- for word in tokens:
163
- if len(word) > 3:
164
- lemma = wn.morphy(word)
165
- else:
166
- lemma = None
167
 
168
- if lemma is None:
169
- lemmas.append(word)
170
- else:
171
- lemmas.append(lemma)
172
- return lemmas
173
 
174
- def get_lemma_tokens(tokens):
175
  # Tokenize the input string into words
176
 
177
- lemmas = []
178
- for word in tokens:
179
- if len(word) > 3:
180
- lemma = wn.morphy(word)
181
- else:
182
- lemma = None
183
 
184
- if lemma is None:
185
- lemmas.append(word)
186
- else:
187
- lemmas.append(lemma)
188
- return lemmas
189
-
190
- # def initial_clean(texts , progress=gr.Progress()):
191
- # clean_texts = []
192
-
193
- # i = 1
194
- # #progress(0, desc="Cleaning texts")
195
- # for text in progress.tqdm(texts, desc = "Cleaning data", unit = "rows"):
196
- # #print("Cleaning row: ", i)
197
- # text = re.sub(email_start_pattern, '', text)
198
- # text = re.sub(email_end_pattern, '', text)
199
- # text = re.sub(postcode_pattern, '', text)
200
- # text = remove_hyphens(text)
201
- # text = re.sub(html_pattern, '', text)
202
- # text = re.sub(email_pattern, '', text)
203
- # text = re.sub(nbsp_pattern, '', text)
204
- # #text = re.sub(warning_pattern, '', text)
205
- # #text = stem_sentence(text)
206
- # text = get_lemma_text(text)
207
- # text = ' '.join(text)
208
- # # Uncomment the next line if you want to remove numbers as well
209
- # # text = re.sub(num_pattern, '', text)
210
- # clean_texts.append(text)
211
-
212
- # i += 1
213
- # return clean_texts
214
-
215
 
216
  def initial_clean(texts , progress=gr.Progress()):
217
  texts = pl.Series(texts)#[]
218
 
219
- #i = 1
220
- #progress(0, desc="Cleaning texts")
221
- #for text in progress.tqdm(texts, desc = "Cleaning data", unit = "rows"):
222
- #print("Cleaning row: ", i)
223
  text = texts.str.replace_all(email_start_pattern_regex, '')
224
  text = text.str.replace_all(email_end_pattern_regex, '')
225
- #text = re.sub(postcode_pattern, '', text)
226
- #text = remove_hyphens(text)
227
  text = text.str.replace_all(html_pattern_regex, '')
228
  text = text.str.replace_all(email_pattern_regex, '')
229
- #text = re.sub(nbsp_pattern, '', text)
230
- #text = re.sub(warning_pattern, '', text)
231
- #text = stem_sentence(text)
232
- #text = get_lemma_text(text)
233
- #text = ' '.join(text)
234
- # Uncomment the next line if you want to remove numbers as well
235
- # text = re.sub(num_pattern, '', text)
236
- #clean_texts.append(text)
237
-
238
- #i += 1
239
 
240
  text = text.to_list()
241
 
242
  return text
243
 
244
-
245
- # Sample execution
246
- #sample_texts = [
247
- # "Hello, this is a test email. kind regards, John",
248
- # "<div>Email content here</div> many thanks, Jane",
249
- # "caution: this email originated from outside of the organization. do not click links or open attachments unless you recognize the sender and know the content is safe.",
250
- # "john.doe123@example.com",
251
- # "Address: 1234 Elm St, AB12 3CD"
252
- #]
253
-
254
- #initial_clean(sample_texts)
255
-
256
-
257
- # +
258
-
259
- all_names = [x.lower() for x in list(nltk.corpus.names.words())]
260
-
261
  def remove_hyphens(text_text):
262
  return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
263
 
264
- # tokenize text
265
- def tokenize_text(text_text):
266
- TOKEN_PATTERN = r'\s+'
267
- regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=True)
268
- word_tokens = regex_wt.tokenize(text_text)
269
- return word_tokens
270
 
271
  def remove_characters_after_tokenization(tokens):
272
  pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
@@ -276,80 +111,22 @@ def remove_characters_after_tokenization(tokens):
276
  def convert_to_lowercase(tokens):
277
  return [token.lower() for token in tokens if token.isalpha()]
278
 
279
- def remove_stopwords(tokens, custom_stopwords):
280
- stopword_list = nltk.corpus.stopwords.words('english')
281
- stopword_list += my_stop_words
282
- filtered_tokens = [token for token in tokens if token not in stopword_list]
283
- return filtered_tokens
284
-
285
- def remove_names(tokens):
286
- stopword_list = list(nltk.corpus.names.words())
287
- stopword_list = [x.lower() for x in stopword_list]
288
- filtered_tokens = [token for token in tokens if token not in stopword_list]
289
- return filtered_tokens
290
-
291
-
292
-
293
  def remove_short_tokens(tokens):
294
  return [token for token in tokens if len(token) > 3]
295
 
296
- def keep_only_words_in_wordnet(tokens):
297
- return [token for token in tokens if wn.synsets(token)]
298
-
299
- def apply_lemmatize(tokens, wnl=WordNetLemmatizer()):
300
-
301
- def lem_word(word):
302
-
303
- if len(word) > 3: out_word = wnl.lemmatize(word)
304
- else: out_word = word
305
-
306
- return out_word
307
-
308
- return [lem_word(token) for token in tokens]
309
-
310
-
311
- # +
312
- ### Do the cleaning
313
-
314
- def cleanTexttexts(texts):
315
- clean_texts = []
316
- for text in texts:
317
- #text = remove_email_start(text)
318
- #text = remove_email_end(text)
319
- text = remove_hyphens(text)
320
- text = cleanhtml(text)
321
- text = cleanemail(text)
322
- text = cleanpostcode(text)
323
- text = cleannum(text)
324
- #text = cleanwarning(text)
325
- text_i = tokenize_text(text)
326
- text_i = remove_characters_after_tokenization(text_i)
327
- #text_i = remove_names(text_i)
328
- text_i = convert_to_lowercase(text_i)
329
- #text_i = remove_stopwords(text_i, my_stop_words)
330
- text_i = get_lemma(text_i)
331
- #text_i = remove_short_tokens(text_i)
332
- text_i = keep_only_words_in_wordnet(text_i)
333
-
334
- text_i = apply_lemmatize(text_i)
335
- clean_texts.append(text_i)
336
- return clean_texts
337
-
338
-
339
- # -
340
 
341
  def remove_dups_text(data_samples_ready, data_samples_clean, data_samples):
342
  # Identify duplicates in the data: https://stackoverflow.com/questions/44191465/efficiently-identify-duplicates-in-large-list-500-000
343
  # Only identifies the second duplicate
344
 
345
  seen = set()
346
- dupes = []
347
 
348
  for i, doi in enumerate(data_samples_ready):
349
  if doi not in seen:
350
  seen.add(doi)
351
  else:
352
- dupes.append(i)
353
  #data_samples_ready[dupes[0:]]
354
 
355
  # To see a specific duplicated value you know the position of
 
1
  # ## Some functions to clean text
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  import re
4
  import string
5
  import polars as pl
 
 
 
 
6
 
7
  # Add calendar months onto stop words
8
  import calendar
9
+ #from tqdm import tqdm
10
  import gradio as gr
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  # Adding custom words to the stopwords
13
  custom_words = []
14
  my_stop_words = custom_words
 
21
  cal_month = [i for i in cal_month if i]
22
  #print(cal_month)
23
  custom_words.extend(cal_month)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ # #### Some of my cleaning functions
27
  email_start_pattern_regex = r'.*importance:|.*subject:'
28
  email_end_pattern_regex = r'kind regards.*|many thanks.*|sincerely.*'
29
  html_pattern_regex = r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});|\xa0|&nbsp;'
 
43
  warning_pattern = re.compile(warning_pattern_regex)
44
  nbsp_pattern = re.compile(nbsp_pattern_regex)
45
 
46
+ # def stem_sentence(sentence):
47
 
48
+ # words = sentence.split()
49
+ # stemmed_words = [stemmer.stem(word).lower().rstrip("'") for word in words]
50
+ # return stemmed_words
51
 
52
+ # def stem_sentences(sentences, progress=gr.Progress()):
53
+ # """Stem each sentence in a list of sentences."""
54
+ # stemmed_sentences = [stem_sentence(sentence) for sentence in progress.tqdm(sentences)]
55
+ # return stemmed_sentences
56
 
57
+ # def get_lemma_text(text):
58
+ # # Tokenize the input string into words
59
+ # tokens = word_tokenize(text)
60
 
61
+ # lemmas = []
62
+ # for word in tokens:
63
+ # if len(word) > 3:
64
+ # lemma = wn.morphy(word)
65
+ # else:
66
+ # lemma = None
67
 
68
+ # if lemma is None:
69
+ # lemmas.append(word)
70
+ # else:
71
+ # lemmas.append(lemma)
72
+ # return lemmas
73
 
74
+ # def get_lemma_tokens(tokens):
75
  # Tokenize the input string into words
76
 
77
+ # lemmas = []
78
+ # for word in tokens:
79
+ # if len(word) > 3:
80
+ # lemma = wn.morphy(word)
81
+ # else:
82
+ # lemma = None
83
 
84
+ # if lemma is None:
85
+ # lemmas.append(word)
86
+ # else:
87
+ # lemmas.append(lemma)
88
+ # return lemmas
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  def initial_clean(texts , progress=gr.Progress()):
91
  texts = pl.Series(texts)#[]
92
 
 
 
 
 
93
  text = texts.str.replace_all(email_start_pattern_regex, '')
94
  text = text.str.replace_all(email_end_pattern_regex, '')
 
 
95
  text = text.str.replace_all(html_pattern_regex, '')
96
  text = text.str.replace_all(email_pattern_regex, '')
 
 
 
 
 
 
 
 
 
 
97
 
98
  text = text.to_list()
99
 
100
  return text
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  def remove_hyphens(text_text):
103
  return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
104
 
 
 
 
 
 
 
105
 
106
  def remove_characters_after_tokenization(tokens):
107
  pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
 
111
  def convert_to_lowercase(tokens):
112
  return [token.lower() for token in tokens if token.isalpha()]
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  def remove_short_tokens(tokens):
115
  return [token for token in tokens if len(token) > 3]
116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  def remove_dups_text(data_samples_ready, data_samples_clean, data_samples):
119
  # Identify duplicates in the data: https://stackoverflow.com/questions/44191465/efficiently-identify-duplicates-in-large-list-500-000
120
  # Only identifies the second duplicate
121
 
122
  seen = set()
123
+ dups = []
124
 
125
  for i, doi in enumerate(data_samples_ready):
126
  if doi not in seen:
127
  seen.add(doi)
128
  else:
129
+ dups.append(i)
130
  #data_samples_ready[dupes[0:]]
131
 
132
  # To see a specific duplicated value you know the position of
search_funcs/{ingest_text.py β†’ convert_files_to_parquet.py} RENAMED
File without changes
search_funcs/helper_functions.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import re
3
+ import pandas as pd
4
+ import gradio as gr
5
+
6
+ import os
7
+ import shutil
8
+
9
+ import os
10
+ import shutil
11
+ import getpass
12
+ import gzip
13
+ import pickle
14
+
15
+ # Attempt to delete content of gradio temp folder
16
+ def get_temp_folder_path():
17
+ username = getpass.getuser()
18
+ return os.path.join('C:\\Users', username, 'AppData\\Local\\Temp\\gradio')
19
+
20
+ def empty_folder(directory_path):
21
+ if not os.path.exists(directory_path):
22
+ #print(f"The directory {directory_path} does not exist. No temporary files from previous app use found to delete.")
23
+ return
24
+
25
+ for filename in os.listdir(directory_path):
26
+ file_path = os.path.join(directory_path, filename)
27
+ try:
28
+ if os.path.isfile(file_path) or os.path.islink(file_path):
29
+ os.unlink(file_path)
30
+ elif os.path.isdir(file_path):
31
+ shutil.rmtree(file_path)
32
+ except Exception as e:
33
+ #print(f'Failed to delete {file_path}. Reason: {e}')
34
+ print('')
35
+
36
+
37
+
38
+
39
+ def get_file_path_end(file_path):
40
+ # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
41
+ basename = os.path.basename(file_path)
42
+
43
+ # Then, split the basename and its extension and return only the basename without the extension
44
+ filename_without_extension, _ = os.path.splitext(basename)
45
+
46
+ #print(filename_without_extension)
47
+
48
+ return filename_without_extension
49
+
50
+ def get_file_path_end_with_ext(file_path):
51
+ match = re.search(r'(.*[\/\\])?(.+)$', file_path)
52
+
53
+ filename_end = match.group(2) if match else ''
54
+
55
+ return filename_end
56
+
57
+ def detect_file_type(filename):
58
+ """Detect the file type based on its extension."""
59
+ if (filename.endswith('.csv')) | (filename.endswith('.csv.gz')) | (filename.endswith('.zip')):
60
+ return 'csv'
61
+ elif filename.endswith('.xlsx'):
62
+ return 'xlsx'
63
+ elif filename.endswith('.parquet'):
64
+ return 'parquet'
65
+ elif filename.endswith('.pkl.gz'):
66
+ return 'pkl.gz'
67
+ else:
68
+ raise ValueError("Unsupported file type.")
69
+
70
+ def read_file(filename):
71
+ """Read the file based on its detected type."""
72
+ file_type = detect_file_type(filename)
73
+
74
+ print("Loading in file")
75
+
76
+ if file_type == 'csv':
77
+ file = pd.read_csv(filename, low_memory=False).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
78
+ elif file_type == 'xlsx':
79
+ file = pd.read_excel(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
80
+ elif file_type == 'parquet':
81
+ file = pd.read_parquet(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
82
+ elif file_type == 'pkl.gz':
83
+ with gzip.open(filename, 'rb') as file:
84
+ file = pickle.load(file)
85
+ #file = pd.read_pickle(filename)
86
+
87
+ print("File load complete")
88
+
89
+ return file
90
+
91
+ def put_columns_in_df(in_file, in_bm25_column):
92
+ '''
93
+ When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
94
+ '''
95
+
96
+ file_list = [string.name for string in in_file]
97
+
98
+ #print(file_list)
99
+
100
+ data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
101
+ data_file_name = data_file_names[0]
102
+
103
+ new_choices = []
104
+ concat_choices = []
105
+
106
+
107
+ df = read_file(data_file_name)
108
+
109
+ if "pkl" not in data_file_name:
110
+
111
+ new_choices = list(df.columns)
112
+
113
+ else: new_choices = ["page_contents"] + list(df[0].metadata.keys()) #["Documents"]
114
+ #print(new_choices)
115
+
116
+ concat_choices.extend(new_choices)
117
+
118
+ return gr.Dropdown(choices=concat_choices), gr.Dropdown(value="No", choices = ["Yes", "No"]), gr.Dropdown(choices=concat_choices), df
119
+
120
+ def put_columns_in_join_df(in_file, in_bm25_column):
121
+ '''
122
+ When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
123
+ '''
124
+
125
+ print("in_bm25_column")
126
+
127
+ new_choices = []
128
+ concat_choices = []
129
+
130
+
131
+ df = read_file(in_file.name)
132
+ new_choices = list(df.columns)
133
+
134
+ print(new_choices)
135
+
136
+ concat_choices.extend(new_choices)
137
+
138
+ return gr.Dropdown(choices=concat_choices)
139
+
140
+ def dummy_function(gradio_component):
141
+ """
142
+ A dummy function that exists just so that dropdown updates work correctly.
143
+ """
144
+ return None
145
+
146
+ def display_info(info_component):
147
+ gr.Info(info_component)
148
+
search_funcs/semantic_functions.py ADDED
@@ -0,0 +1,422 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import pandas as pd
4
+ from typing import Type
5
+ import gradio as gr
6
+ import numpy as np
7
+ from datetime import datetime
8
+ import accelerate
9
+
10
+ today_rev = datetime.now().strftime("%Y%m%d")
11
+
12
+ from transformers import AutoModel
13
+
14
+ from torch import cuda, backends, tensor, mm
15
+ from search_funcs.helper_functions import read_file
16
+
17
+ # Check for torch cuda
18
+ print("Is CUDA enabled? ", cuda.is_available())
19
+ print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
20
+ if cuda.is_available():
21
+ torch_device = "cuda"
22
+ os.system("nvidia-smi")
23
+
24
+ else:
25
+ torch_device = "cpu"
26
+
27
+ print("Device used is: ", torch_device)
28
+
29
+ #from search_funcs.helper_functions import get_file_path_end
30
+
31
+ PandasDataFrame = Type[pd.DataFrame]
32
+
33
+ # Load embeddings
34
+ # Pinning a Jina revision for security purposes: https://www.baseten.co/blog/pinning-ml-model-revisions-for-compatibility-and-security/
35
+ # Save Jina model locally as described here: https://huggingface.co/jinaai/jina-embeddings-v2-base-en/discussions/29
36
+ embeddings_name = "jinaai/jina-embeddings-v2-small-en"
37
+ local_embeddings_location = "model/jina/"
38
+ revision_choice = "b811f03af3d4d7ea72a7c25c802b21fc675a5d99"
39
+
40
+ try:
41
+ embeddings_model = AutoModel.from_pretrained(local_embeddings_location, revision = revision_choice, trust_remote_code=True,local_files_only=True, device_map="auto")
42
+ except:
43
+ embeddings_model = AutoModel.from_pretrained(embeddings_name, revision = revision_choice, trust_remote_code=True, device_map="auto")
44
+
45
+
46
+ # Chroma support is currently deprecated
47
+ # Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
48
+ #import chromadb
49
+ #from chromadb.config import Settings
50
+ #from typing_extensions import Protocol
51
+ #from chromadb import Documents, EmbeddingFunction, Embeddings
52
+
53
+ # Remove Chroma database file. If it exists as it can cause issues
54
+ #chromadb_file = "chroma.sqlite3"
55
+
56
+ #if os.path.isfile(chromadb_file):
57
+ # os.remove(chromadb_file)
58
+ def get_file_path_end(file_path):
59
+ # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
60
+ basename = os.path.basename(file_path)
61
+
62
+ # Then, split the basename and its extension and return only the basename without the extension
63
+ filename_without_extension, _ = os.path.splitext(basename)
64
+
65
+ #print(filename_without_extension)
66
+
67
+ return filename_without_extension
68
+
69
+ def load_embeddings(embeddings_name = embeddings_name):
70
+ '''
71
+ Load embeddings model and create a global variable based on it.
72
+ '''
73
+
74
+ # Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
75
+
76
+ #else:
77
+ embeddings_func = AutoModel.from_pretrained(embeddings_name, trust_remote_code=True, device_map="auto")
78
+
79
+ global embeddings
80
+
81
+ embeddings = embeddings_func
82
+
83
+ return embeddings
84
+
85
+ def docs_to_jina_embed_np_array(docs_out, in_file, return_intermediate_files = "No", embeddings_super_compress = "No", embeddings = embeddings_model, progress=gr.Progress()):
86
+ '''
87
+ Takes a Langchain document class and saves it into a Chroma sqlite file.
88
+ '''
89
+
90
+ print(f"> Total split documents: {len(docs_out)}")
91
+
92
+ #print(docs_out)
93
+
94
+ page_contents = [doc.page_content for doc in docs_out]
95
+
96
+ ## Load in pre-embedded file if exists
97
+ file_list = [string.name for string in in_file]
98
+
99
+ #print(file_list)
100
+
101
+ embeddings_file_names = [string for string in file_list if "embedding" in string]
102
+ data_file_names = [string for string in file_list if "tokenised" not in string]
103
+ data_file_name = data_file_names[0]
104
+ data_file_name_no_ext = get_file_path_end(data_file_name)
105
+
106
+ out_message = "Document processing complete. Ready to search."
107
+
108
+ if embeddings_file_names:
109
+ print("Loading embeddings from file.")
110
+ embeddings_out = np.load(embeddings_file_names[0])['arr_0']
111
+
112
+ # If embedding files have 'super_compress' in the title, they have been multiplied by 100 before save
113
+ if "super_compress" in embeddings_file_names[0]:
114
+ embeddings_out /= 100
115
+
116
+ # print("embeddings loaded: ", embeddings_out)
117
+
118
+ if not embeddings_file_names:
119
+ tic = time.perf_counter()
120
+ print("Starting to embed documents.")
121
+ #embeddings_list = []
122
+ #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
123
+ # embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
124
+
125
+ embeddings_out = embeddings.encode(sentences=page_contents, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina embeddings
126
+ #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
127
+ #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
128
+
129
+
130
+
131
+ toc = time.perf_counter()
132
+ time_out = f"The embedding took {toc - tic:0.1f} seconds"
133
+ print(time_out)
134
+
135
+ # If you want to save your files for next time
136
+ if return_intermediate_files == "Yes":
137
+ if embeddings_super_compress == "No":
138
+ semantic_search_file_name = data_file_name_no_ext + '_' + 'semantic_search_embeddings.npz'
139
+ np.savez_compressed(semantic_search_file_name, embeddings_out)
140
+ else:
141
+ semantic_search_file_name = data_file_name_no_ext + '_' + 'semantic_search_embeddings_super_compress.npz'
142
+ embeddings_out_round = np.round(embeddings_out, 3)
143
+ embeddings_out_round *= 100 # Rounding not currently used
144
+ np.savez_compressed(semantic_search_file_name, embeddings_out_round)
145
+
146
+ return out_message, embeddings_out, semantic_search_file_name
147
+
148
+ return out_message, embeddings_out, None
149
+
150
+ print(out_message)
151
+
152
+ return out_message, embeddings_out, None#, None
153
+
154
+ def process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column):
155
+
156
+ def create_docs_keep_from_df(df):
157
+ dict_out = {'ids' : [df['ids']],
158
+ 'documents': [df['documents']],
159
+ 'metadatas': [df['metadatas']],
160
+ 'distances': [round(df['distances'].astype(float), 4)],
161
+ 'embeddings': None
162
+ }
163
+ return dict_out
164
+
165
+ # Prepare the DataFrame by transposing
166
+ #df_docs = df#.apply(lambda x: x.explode()).reset_index(drop=True)
167
+
168
+ # Keep only documents with a certain score
169
+
170
+ #print(df_docs)
171
+
172
+ docs_scores = df_docs["distances"] #.astype(float)
173
+
174
+ # Only keep sources that are sufficiently relevant (i.e. similarity search score below threshold below)
175
+ score_more_limit = df_docs.loc[docs_scores > vec_score_cut_off, :]
176
+ #docs_keep = create_docs_keep_from_df(score_more_limit) #list(compress(docs, score_more_limit))
177
+
178
+ #print(docs_keep)
179
+
180
+ if score_more_limit.empty:
181
+ return pd.DataFrame()
182
+
183
+ # Only keep sources that are at least 100 characters long
184
+ docs_len = score_more_limit["documents"].str.len() >= 100
185
+
186
+ #print(docs_len)
187
+
188
+ length_more_limit = score_more_limit.loc[docs_len == True, :] #pd.Series(docs_len) >= 100
189
+ #docs_keep = create_docs_keep_from_df(length_more_limit) #list(compress(docs_keep, length_more_limit))
190
+
191
+ #print(length_more_limit)
192
+
193
+ if length_more_limit.empty:
194
+ return pd.DataFrame()
195
+
196
+ length_more_limit['ids'] = length_more_limit['ids'].astype(int)
197
+
198
+ #length_more_limit.to_csv("length_more_limit.csv", index = None)
199
+
200
+ # Explode the 'metadatas' dictionary into separate columns
201
+ df_metadata_expanded = length_more_limit['metadatas'].apply(pd.Series)
202
+
203
+ #print(length_more_limit)
204
+ #print(df_metadata_expanded)
205
+
206
+ # Concatenate the original DataFrame with the expanded metadata DataFrame
207
+ results_df_out = pd.concat([length_more_limit.drop('metadatas', axis=1), df_metadata_expanded], axis=1)
208
+
209
+ results_df_out = results_df_out.rename(columns={"documents":orig_df_col})
210
+
211
+ results_df_out = results_df_out.drop(["page_section", "row", "source", "id"], axis=1, errors="ignore")
212
+ results_df_out['distances'] = round(results_df_out['distances'].astype(float), 3)
213
+
214
+ # Join back to original df
215
+ # results_df_out = orig_df.merge(length_more_limit[['ids', 'distances']], left_index = True, right_on = "ids", how="inner").sort_values("distances")
216
+
217
+ # Join on additional files
218
+ if in_join_file:
219
+ join_filename = in_join_file.name
220
+
221
+ # Import data
222
+ join_df = read_file(join_filename)
223
+ join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
224
+
225
+ # Duplicates dropped so as not to expand out dataframe
226
+ join_df = join_df.drop_duplicates(in_join_column)
227
+
228
+ results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
229
+
230
+ results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
231
+
232
+ return results_df_out
233
+
234
+ def jina_simple_retrieval(new_question_kworded:str, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
235
+ vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None, device = torch_device, embeddings = embeddings_model, progress=gr.Progress()): # ,vectorstore, embeddings
236
+
237
+ # print("vectorstore loaded: ", vectorstore)
238
+
239
+ # Convert it to a PyTorch tensor and transfer to GPU
240
+ vectorstore_tensor = tensor(vectorstore).to(device)
241
+
242
+ # Load the sentence transformer model and move it to GPU
243
+ embeddings = embeddings.to(device)
244
+
245
+ # Encode the query using the sentence transformer and convert to a PyTorch tensor
246
+ query = embeddings.encode(new_question_kworded)
247
+ query_tensor = tensor(query).to(device)
248
+
249
+ if query_tensor.dim() == 1:
250
+ query_tensor = query_tensor.unsqueeze(0) # Reshape to 2D with one row
251
+
252
+ # Normalize the query tensor and vectorstore tensor
253
+ query_norm = query_tensor / query_tensor.norm(dim=1, keepdim=True)
254
+ vectorstore_norm = vectorstore_tensor / vectorstore_tensor.norm(dim=1, keepdim=True)
255
+
256
+ # Calculate cosine similarities (batch processing)
257
+ cosine_similarities = mm(query_norm, vectorstore_norm.T)
258
+
259
+ # Flatten the tensor to a 1D array
260
+ cosine_similarities = cosine_similarities.flatten()
261
+
262
+ # Convert to a NumPy array if it's still a PyTorch tensor
263
+ cosine_similarities = cosine_similarities.cpu().numpy()
264
+
265
+ # Create a Pandas Series
266
+ cosine_similarities_series = pd.Series(cosine_similarities)
267
+
268
+ # Pull out relevent info from docs
269
+ page_contents = [doc.page_content for doc in docs]
270
+ page_meta = [doc.metadata for doc in docs]
271
+ ids_range = range(0,len(page_contents))
272
+ ids = [str(element) for element in ids_range]
273
+
274
+ df_docs = pd.DataFrame(data={"ids": ids,
275
+ "documents": page_contents,
276
+ "metadatas":page_meta,
277
+ "distances":cosine_similarities_series}).sort_values("distances", ascending=False).iloc[0:k_val,:]
278
+
279
+
280
+ results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
281
+
282
+ # If nothing found, return error message
283
+ if results_df_out.empty:
284
+ return 'No result found!', None
285
+
286
+ results_df_name = "semantic_search_result_" + today_rev + ".csv"
287
+ results_df_out.to_csv(results_df_name, index= None)
288
+ results_first_text = results_df_out.iloc[0, 1]
289
+
290
+ return results_first_text, results_df_name
291
+
292
+ # Deprecated Chroma functions - kept just in case needed in future.
293
+
294
+ def docs_to_chroma_save_deprecated(docs_out, embeddings = embeddings_model, progress=gr.Progress()):
295
+ '''
296
+ Takes a Langchain document class and saves it into a Chroma sqlite file. Not currently used.
297
+ '''
298
+
299
+ print(f"> Total split documents: {len(docs_out)}")
300
+
301
+ #print(docs_out)
302
+
303
+ page_contents = [doc.page_content for doc in docs_out]
304
+ page_meta = [doc.metadata for doc in docs_out]
305
+ ids_range = range(0,len(page_contents))
306
+ ids = [str(element) for element in ids_range]
307
+
308
+ tic = time.perf_counter()
309
+ #embeddings_list = []
310
+ #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
311
+ # embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
312
+
313
+ embeddings_list = embeddings.encode(sentences=page_contents, max_length=256, show_progress_bar = True, batch_size = 32).tolist() # For Jina embeddings
314
+ #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
315
+ #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
316
+
317
+ toc = time.perf_counter()
318
+ time_out = f"The embedding took {toc - tic:0.1f} seconds"
319
+
320
+ #pd.Series(embeddings_list).to_csv("embeddings_out.csv")
321
+
322
+ # Jina tiny
323
+ # This takes about 300 seconds for 240,000 records = 800 / second, 1024 max length
324
+ # For 50k records:
325
+ # 61 seconds at 1024 max length
326
+ # 55 seconds at 512 max length
327
+ # 43 seconds at 256 max length
328
+ # 31 seconds at 128 max length
329
+
330
+ # The embedding took 1372.5 seconds at 256 max length for 655,020 case notes
331
+
332
+ # BGE small
333
+ # 96 seconds for 50k records at 512 length
334
+
335
+ # all-MiniLM-L6-v2
336
+ # 42.5 seconds at (256?) max length
337
+
338
+ # paraphrase-MiniLM-L3-v2
339
+ # 22 seconds for 128 max length
340
+
341
+
342
+ print(time_out)
343
+
344
+ chroma_tic = time.perf_counter()
345
+
346
+ # Create a new Chroma collection to store the documents and metadata. We don't need to specify an embedding fuction, and the default will be used.
347
+ client = chromadb.PersistentClient(path="./last_year", settings=Settings(
348
+ anonymized_telemetry=False))
349
+
350
+ try:
351
+ print("Deleting existing collection.")
352
+ #collection = client.get_collection(name="my_collection")
353
+ client.delete_collection(name="my_collection")
354
+ print("Creating new collection.")
355
+ collection = client.create_collection(name="my_collection")
356
+ except:
357
+ print("Creating new collection.")
358
+ collection = client.create_collection(name="my_collection")
359
+
360
+ # Match batch size is about 40,000, so add that amount in a loop
361
+ def create_batch_ranges(in_list, batch_size=40000):
362
+ total_rows = len(in_list)
363
+ ranges = []
364
+
365
+ for start in range(0, total_rows, batch_size):
366
+ end = min(start + batch_size, total_rows)
367
+ ranges.append(range(start, end))
368
+
369
+ return ranges
370
+
371
+ batch_ranges = create_batch_ranges(embeddings_list)
372
+ print(batch_ranges)
373
+
374
+ for row_range in progress.tqdm(batch_ranges, desc = "Creating vector database", unit = "batches of 40,000 rows"):
375
+
376
+ collection.add(
377
+ documents = page_contents[row_range[0]:row_range[-1]],
378
+ embeddings = embeddings_list[row_range[0]:row_range[-1]],
379
+ metadatas = page_meta[row_range[0]:row_range[-1]],
380
+ ids = ids[row_range[0]:row_range[-1]])
381
+ #print("Here")
382
+
383
+ # print(collection.count())
384
+
385
+
386
+ #chatf.vectorstore = vectorstore_func
387
+
388
+ chroma_toc = time.perf_counter()
389
+
390
+ chroma_time_out = f"Loading to Chroma db took {chroma_toc - chroma_tic:0.1f} seconds"
391
+ print(chroma_time_out)
392
+
393
+ out_message = "Document processing complete"
394
+
395
+ return out_message, collection
396
+
397
+ def chroma_retrieval_deprecated(new_question_kworded:str, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
398
+ vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None, embeddings = embeddings_model): # ,vectorstore, embeddings
399
+
400
+ query = embeddings.encode(new_question_kworded).tolist()
401
+
402
+ docs = vectorstore.query(
403
+ query_embeddings=query,
404
+ n_results= k_val # No practical limit on number of responses returned
405
+ #where={"metadata_field": "is_equal_to_this"},
406
+ #where_document={"$contains":"search_string"}
407
+ )
408
+
409
+ df_docs = pd.DataFrame(data={'ids': docs['ids'][0],
410
+ 'documents': docs['documents'][0],
411
+ 'metadatas':docs['metadatas'][0],
412
+ 'distances':docs['distances'][0]#,
413
+ #'embeddings': docs['embeddings']
414
+ })
415
+
416
+ results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
417
+
418
+ results_df_name = "semantic_search_result.csv"
419
+ results_df_out.to_csv(results_df_name, index= None)
420
+ results_first_text = results_df_out[orig_df_col].iloc[0]
421
+
422
+ return results_first_text, results_df_name
search_funcs/{ingest.py β†’ semantic_ingest_functions.py} RENAMED
@@ -4,27 +4,17 @@ import os
4
  import time
5
  import re
6
  import ast
 
7
  import pandas as pd
8
  import gradio as gr
9
  from typing import Type, List, Literal
10
- from langchain.text_splitter import RecursiveCharacterTextSplitter
11
 
12
  from pydantic import BaseModel, Field
13
 
14
  # Creating an alias for pandas DataFrame using Type
15
  PandasDataFrame = Type[pd.DataFrame]
16
 
17
- # class Document(BaseModel):
18
- # """Class for storing a piece of text and associated metadata. Implementation adapted from Langchain code: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/documents/base.py"""
19
-
20
- # page_content: str
21
- # """String text."""
22
- # metadata: dict = Field(default_factory=dict)
23
- # """Arbitrary metadata about the page content (e.g., source, relationships to other
24
- # documents, etc.).
25
- # """
26
- # type: Literal["Document"] = "Document"
27
-
28
  class Document(BaseModel):
29
  """Class for storing a piece of text and associated metadata. Implementation adapted from Langchain code: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/documents/base.py"""
30
 
@@ -36,25 +26,30 @@ class Document(BaseModel):
36
  """
37
  type: Literal["Document"] = "Document"
38
 
 
39
  split_strat = ["\n\n", "\n", ". ", "! ", "? "]
40
- chunk_size = 500
41
  chunk_overlap = 0
42
  start_index = True
43
 
 
 
 
 
44
  ## Parse files
45
- def determine_file_type(file_path):
46
- """
47
- Determine the file type based on its extension.
48
 
49
- Parameters:
50
- file_path (str): Path to the file.
51
 
52
- Returns:
53
- str: File extension (e.g., '.pdf', '.docx', '.txt', '.html').
54
- """
55
- return os.path.splitext(file_path)[1].lower()
56
 
57
- def parse_file(file_paths, text_column='text'):
58
  """
59
  Accepts a list of file paths, determines each file's type based on its extension,
60
  and passes it to the relevant parsing function.
@@ -87,16 +82,16 @@ def parse_file(file_paths, text_column='text'):
87
  file_names = []
88
 
89
  for file_path in file_paths:
90
- print(file_path.name)
91
  #file = open(file_path.name, 'r')
92
  #print(file)
93
- file_extension = determine_file_type(file_path.name)
94
  if file_extension in extension_to_parser:
95
  parsed_contents[file_path.name] = extension_to_parser[file_extension](file_path.name)
96
  else:
97
  parsed_contents[file_path.name] = f"Unsupported file type: {file_extension}"
98
 
99
- filename_end = get_file_path_end(file_path.name)
100
 
101
  file_names.append(filename_end)
102
 
@@ -117,7 +112,7 @@ def text_regex_clean(text):
117
 
118
  return text
119
 
120
- def parse_csv_or_excel(file_path, text_column = "text"):
121
  """
122
  Read in a CSV or Excel file.
123
 
@@ -133,91 +128,50 @@ def parse_csv_or_excel(file_path, text_column = "text"):
133
 
134
  file_list = [string.name for string in file_path]
135
 
136
- print(file_list)
137
 
138
- data_file_names = [string for string in file_list if "tokenised" not in string]
139
 
 
140
 
141
  #for file_path in file_paths:
142
- file_extension = determine_file_type(data_file_names[0])
143
- file_name = get_file_path_end(data_file_names[0])
144
- file_names = [file_name]
145
-
146
- print(file_extension)
147
-
148
- if file_extension == ".csv":
149
- df = pd.read_csv(data_file_names[0], low_memory=False)
150
- if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
151
- df['source'] = file_name
152
- df['page_section'] = ""
153
- elif file_extension == ".xlsx":
154
- df = pd.read_excel(data_file_names[0], engine='openpyxl')
155
- if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
156
- df['source'] = file_name
157
- df['page_section'] = ""
158
- elif file_extension == ".parquet":
159
- df = pd.read_parquet(data_file_names[0])
160
- if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
161
- df['source'] = file_name
162
- df['page_section'] = ""
163
- else:
164
- print(f"Unsupported file type: {file_extension}")
165
- return pd.DataFrame(), ['Please choose a valid file type']
166
 
 
 
 
 
167
  message = "Loaded in file. Now converting to document format."
168
  print(message)
169
 
170
- return df, file_names, message
171
 
172
- def get_file_path_end(file_path):
173
- match = re.search(r'(.*[\/\\])?(.+)$', file_path)
174
-
175
- filename_end = match.group(2) if match else ''
176
-
177
- return filename_end
178
 
179
  # +
180
  # Convert parsed text to docs
181
  # -
182
 
183
- def text_to_docs(text_dict: dict, chunk_size: int = chunk_size) -> List[Document]:
184
- """
185
- Converts the output of parse_file (a dictionary of file paths to content)
186
- to a list of Documents with metadata.
187
- """
188
-
189
- doc_sections = []
190
- parent_doc_sections = []
191
-
192
- for file_path, content in text_dict.items():
193
- ext = os.path.splitext(file_path)[1].lower()
194
-
195
- # Depending on the file extension, handle the content
196
- # if ext == '.pdf':
197
- # docs, page_docs = pdf_text_to_docs(content, chunk_size)
198
- # elif ext in ['.html', '.htm', '.txt', '.docx']:
199
- # docs = html_text_to_docs(content, chunk_size)
200
- if ext in ['.csv', '.xlsx']:
201
- docs, page_docs = csv_excel_text_to_docs(content, chunk_size)
202
- else:
203
- print(f"Unsupported file type {ext} for {file_path}. Skipping.")
204
- continue
205
-
206
-
207
- filename_end = get_file_path_end(file_path)
208
-
209
- #match = re.search(r'(.*[\/\\])?(.+)$', file_path)
210
- #filename_end = match.group(2) if match else ''
211
-
212
- # Add filename as metadata
213
- for doc in docs: doc.metadata["source"] = filename_end
214
- #for parent_doc in parent_docs: parent_doc.metadata["source"] = filename_end
215
-
216
- doc_sections.extend(docs)
217
- #parent_doc_sections.extend(parent_docs)
218
-
219
- return doc_sections#, page_docs
220
-
221
  def write_out_metadata_as_string(metadata_in):
222
  # If metadata_in is a single dictionary, wrap it in a list
223
  if isinstance(metadata_in, dict):
@@ -228,74 +182,39 @@ def write_out_metadata_as_string(metadata_in):
228
 
229
  def combine_metadata_columns(df, cols):
230
 
231
- df['metadatas'] = "{"
232
- df['blank_column'] = ""
233
 
234
  for n, col in enumerate(cols):
235
  df[col] = df[col].astype(str).str.replace('"',"'").str.replace('\n', ' ').str.replace('\r', ' ').str.replace('\r\n', ' ').str.cat(df['blank_column'].astype(str), sep="")
236
 
237
- df['metadatas'] = df['metadatas'] + '"' + cols[n] + '": "' + df[col] + '", '
238
-
239
-
240
- df['metadatas'] = (df['metadatas'] + "}").str.replace(', }', '}')
241
 
242
- return df['metadatas']
243
 
244
- def csv_excel_text_to_docs(df, text_column='text', chunk_size=None) -> List[Document]:
245
- """Converts a DataFrame's content to a list of Documents with metadata."""
246
-
247
- #print(df.head())
248
-
249
- print("Converting to documents.")
250
-
251
- doc_sections = []
252
- df[text_column] = df[text_column].astype(str) # Ensure column is a string column
253
 
254
- # For each row in the dataframe
255
- for idx, row in df.iterrows():
256
- # Extract the text content for the document
257
- doc_content = row[text_column]
258
-
259
- # Generate metadata containing other columns' data
260
- metadata = {"row": idx + 1}
261
- for col, value in row.items():
262
- if col != text_column:
263
- metadata[col] = value
264
-
265
- metadata_string = write_out_metadata_as_string(metadata)[0]
266
-
267
- # If chunk_size is provided, split the text into chunks
268
- if chunk_size:
269
- # Assuming you have a text splitter function similar to the PDF handling
270
- text_splitter = RecursiveCharacterTextSplitter(
271
- chunk_size=chunk_size,
272
- chunk_overlap=chunk_overlap,
273
- split_strat=split_strat,
274
- start_index=start_index
275
- ) #Other arguments as required by the splitter
276
-
277
- sections = text_splitter.split_text(doc_content)
278
-
279
-
280
- # For each section, create a Document object
281
- for i, section in enumerate(sections):
282
- section = '. '.join([metadata_string, section])
283
- doc = Document(page_content=section,
284
- metadata={**metadata, "section": i, "row_section": f"{metadata['row']}-{i}"})
285
- doc_sections.append(doc)
286
-
287
- #print("Chunking currently disabled")
288
-
289
- else:
290
- # If no chunk_size is provided, create a single Document object for the row
291
- #doc_content = '. '.join([metadata_string, doc_content])
292
- doc = Document(page_content=doc_content, metadata=metadata)
293
- doc_sections.append(doc)
294
 
295
- message = "Data converted to document format. Now creating/loading document embeddings."
296
- print(message)
 
 
297
 
298
- return doc_sections, message
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
 
300
  def clean_line_breaks(text):
301
  # Replace \n and \r\n with a space
@@ -322,14 +241,106 @@ def parse_metadata(row):
322
  # Handle the error or log it
323
  return None # or some default value
324
 
325
- def csv_excel_text_to_docs(df, text_column='text', chunk_size=None, progress=gr.Progress()) -> List[Document]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
326
  """Converts a DataFrame's content to a list of dictionaries in the 'Document' format, containing page_content and associated metadata."""
327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328
  ingest_tic = time.perf_counter()
329
 
330
  doc_sections = []
331
  df[text_column] = df[text_column].astype(str).str.strip() # Ensure column is a string column
332
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
333
  cols = [col for col in df.columns if col != text_column]
334
 
335
  df["metadata"] = combine_metadata_columns(df, cols)
@@ -341,71 +352,75 @@ def csv_excel_text_to_docs(df, text_column='text', chunk_size=None, progress=gr.
341
  #doc_sections = df[["page_content", "metadata"]].to_dict(orient='records')
342
  #doc_sections = [Document(**row) for row in df[["page_content", "metadata"]].to_dict(orient='records')]
343
 
 
344
  # Create a list of Document objects
345
  doc_sections = [Document(page_content=row['page_content'],
346
  metadata= parse_metadata(row["metadata"]))
347
- for index, row in progress.tqdm(df.iterrows(), desc = "Splitting up text", unit = "rows")]
348
-
349
  ingest_toc = time.perf_counter()
350
 
351
  ingest_time_out = f"Preparing documents took {ingest_toc - ingest_tic:0.1f} seconds"
352
  print(ingest_time_out)
353
 
354
- return doc_sections, "Finished splitting documents"
355
-
356
- # # Functions for working with documents after loading them back in
357
-
358
- def pull_out_data(series):
 
 
 
 
359
 
360
- # define a lambda function to convert each string into a tuple
361
- to_tuple = lambda x: eval(x)
362
 
363
- # apply the lambda function to each element of the series
364
- series_tup = series.apply(to_tuple)
365
 
366
- series_tup_content = list(zip(*series_tup))[1]
 
367
 
368
- series = pd.Series(list(series_tup_content))#.str.replace("^Main post content", "", regex=True).str.strip()
 
 
369
 
370
- return series
 
371
 
372
- def docs_from_csv(df):
 
373
 
374
- import ast
375
-
376
- documents = []
377
-
378
- page_content = pull_out_data(df["0"])
379
- metadatas = pull_out_data(df["1"])
380
-
381
- for x in range(0,len(df)):
382
- new_doc = Document(page_content=page_content[x], metadata=metadatas[x])
383
- documents.append(new_doc)
384
-
385
- return documents
386
-
387
- def docs_from_lists(docs, metadatas):
388
-
389
- documents = []
390
-
391
- for x, doc in enumerate(docs):
392
- new_doc = Document(page_content=doc, metadata=metadatas[x])
393
- documents.append(new_doc)
394
-
395
- return documents
396
 
397
- def docs_elements_from_csv_save(docs_path="documents.csv"):
398
 
399
- documents = pd.read_csv(docs_path)
 
 
 
 
400
 
401
- docs_out = docs_from_csv(documents)
 
 
 
402
 
403
- out_df = pd.DataFrame(docs_out)
 
 
404
 
405
- docs_content = pull_out_data(out_df[0].astype(str))
 
406
 
407
- docs_meta = pull_out_data(out_df[1].astype(str))
 
 
408
 
409
- doc_sources = [d['source'] for d in docs_meta]
 
 
 
 
410
 
411
- return out_df, docs_content, docs_meta, doc_sources
 
 
4
  import time
5
  import re
6
  import ast
7
+ import gzip
8
  import pandas as pd
9
  import gradio as gr
10
  from typing import Type, List, Literal
11
+ #from langchain.text_splitter import RecursiveCharacterTextSplitter
12
 
13
  from pydantic import BaseModel, Field
14
 
15
  # Creating an alias for pandas DataFrame using Type
16
  PandasDataFrame = Type[pd.DataFrame]
17
 
 
 
 
 
 
 
 
 
 
 
 
18
  class Document(BaseModel):
19
  """Class for storing a piece of text and associated metadata. Implementation adapted from Langchain code: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/documents/base.py"""
20
 
 
26
  """
27
  type: Literal["Document"] = "Document"
28
 
29
+ # Constants for chunking - not currently used
30
  split_strat = ["\n\n", "\n", ". ", "! ", "? "]
31
+ chunk_size = 512
32
  chunk_overlap = 0
33
  start_index = True
34
 
35
+ from search_funcs.helper_functions import get_file_path_end_with_ext, detect_file_type, get_file_path_end
36
+ from search_funcs.bm25_functions import save_prepared_bm25_data
37
+ from search_funcs.clean_funcs import initial_clean
38
+
39
  ## Parse files
40
+ # def detect_file_type(file_path):
41
+ # """
42
+ # Determine the file type based on its extension.
43
 
44
+ # Parameters:
45
+ # file_path (str): Path to the file.
46
 
47
+ # Returns:
48
+ # str: File extension (e.g., '.pdf', '.docx', '.txt', '.html').
49
+ # """
50
+ # return os.path.splitext(file_path)[1].lower()
51
 
52
+ def parse_file_not_used(file_paths, text_column='text'):
53
  """
54
  Accepts a list of file paths, determines each file's type based on its extension,
55
  and passes it to the relevant parsing function.
 
82
  file_names = []
83
 
84
  for file_path in file_paths:
85
+ #print(file_path.name)
86
  #file = open(file_path.name, 'r')
87
  #print(file)
88
+ file_extension = detect_file_type(file_path.name)
89
  if file_extension in extension_to_parser:
90
  parsed_contents[file_path.name] = extension_to_parser[file_extension](file_path.name)
91
  else:
92
  parsed_contents[file_path.name] = f"Unsupported file type: {file_extension}"
93
 
94
+ filename_end = get_file_path_end_with_ext(file_path.name)
95
 
96
  file_names.append(filename_end)
97
 
 
112
 
113
  return text
114
 
115
+ def parse_csv_or_excel(file_path, data_state, text_column = "text"):
116
  """
117
  Read in a CSV or Excel file.
118
 
 
128
 
129
  file_list = [string.name for string in file_path]
130
 
131
+ #print(file_list)
132
 
133
+ data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
134
 
135
+ data_file_name = data_file_names[0]
136
 
137
  #for file_path in file_paths:
138
+ file_name = get_file_path_end_with_ext(data_file_name)
139
+
140
+ #print(file_extension)
141
+
142
+ # if file_extension == "csv":
143
+ # df = pd.read_csv(data_file_names[0], low_memory=False)
144
+ # if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
145
+ # df['source'] = file_name
146
+ # df['page_section'] = ""
147
+ # elif file_extension == "xlsx":
148
+ # df = pd.read_excel(data_file_names[0], engine='openpyxl')
149
+ # if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
150
+ # df['source'] = file_name
151
+ # df['page_section'] = ""
152
+ # elif file_extension == "parquet":
153
+ # df = pd.read_parquet(data_file_names[0])
154
+ # if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
155
+ # df['source'] = file_name
156
+ # df['page_section'] = ""
157
+ # else:
158
+ # print(f"Unsupported file type: {file_extension}")
159
+ # return pd.DataFrame(), ['Please choose a valid file type']
 
 
160
 
161
+ df = data_state
162
+ #df['source'] = file_name
163
+ #df['page_section'] = ""
164
+
165
  message = "Loaded in file. Now converting to document format."
166
  print(message)
167
 
168
+ return df, file_name, message
169
 
 
 
 
 
 
 
170
 
171
  # +
172
  # Convert parsed text to docs
173
  # -
174
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  def write_out_metadata_as_string(metadata_in):
176
  # If metadata_in is a single dictionary, wrap it in a list
177
  if isinstance(metadata_in, dict):
 
182
 
183
  def combine_metadata_columns(df, cols):
184
 
185
+ df['metadata'] = '{'
186
+ df['blank_column'] = ''
187
 
188
  for n, col in enumerate(cols):
189
  df[col] = df[col].astype(str).str.replace('"',"'").str.replace('\n', ' ').str.replace('\r', ' ').str.replace('\r\n', ' ').str.cat(df['blank_column'].astype(str), sep="")
190
 
191
+ df['metadata'] = df['metadata'] + '"' + cols[n] + '": "' + df[col] + '", '
 
 
 
192
 
 
193
 
194
+ df['metadata'] = (df['metadata'] + "}").str.replace(', }', '}').str.replace('", }"', '}')
 
 
 
 
 
 
 
 
195
 
196
+ return df['metadata']
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
+ def split_string_into_chunks(input_string, max_length, split_symbols):
199
+ # Check if input_string or split_symbols are empty
200
+ if not input_string or not split_symbols:
201
+ return [input_string]
202
 
203
+ chunks = []
204
+ current_chunk = ""
205
+
206
+ for char in input_string:
207
+ current_chunk += char
208
+ if len(current_chunk) >= max_length or char in split_symbols:
209
+ # Add the current chunk to the chunks list
210
+ chunks.append(current_chunk)
211
+ current_chunk = ""
212
+
213
+ # Adding any remaining part of the string
214
+ if current_chunk:
215
+ chunks.append(current_chunk)
216
+
217
+ return chunks
218
 
219
  def clean_line_breaks(text):
220
  # Replace \n and \r\n with a space
 
241
  # Handle the error or log it
242
  return None # or some default value
243
 
244
+ # def csv_excel_text_to_docs_deprecated(df, text_column='text', chunk_size=None) -> List[Document]:
245
+ # """Converts a DataFrame's content to a list of Documents with metadata."""
246
+
247
+ # print("Converting to documents.")
248
+
249
+ # doc_sections = []
250
+ # df[text_column] = df[text_column].astype(str) # Ensure column is a string column
251
+
252
+ # # For each row in the dataframe
253
+ # for idx, row in df.iterrows():
254
+ # # Extract the text content for the document
255
+ # doc_content = row[text_column]
256
+
257
+ # # Generate metadata containing other columns' data
258
+ # metadata = {"row": idx + 1}
259
+ # for col, value in row.items():
260
+ # if col != text_column:
261
+ # metadata[col] = value
262
+
263
+ # metadata_string = write_out_metadata_as_string(metadata)[0]
264
+
265
+ # # If chunk_size is provided, split the text into chunks
266
+ # if chunk_size:
267
+ # sections = split_string_into_chunks(doc_content, chunk_size, split_strat)
268
+
269
+ # # Langchain usage deprecated
270
+ # # text_splitter = RecursiveCharacterTextSplitter(
271
+ # # chunk_size=chunk_size,
272
+ # # chunk_overlap=chunk_overlap,
273
+ # # split_strat=split_strat,
274
+ # # start_index=start_index
275
+ # # ) #Other arguments as required by the splitter
276
+
277
+ # # sections = text_splitter.split_text(doc_content)
278
+
279
+ # # For each section, create a Document object
280
+ # for i, section in enumerate(sections):
281
+ # section = '. '.join([metadata_string, section])
282
+ # doc = Document(page_content=section,
283
+ # metadata={**metadata, "section": i, "row_section": f"{metadata['row']}-{i}"})
284
+ # doc_sections.append(doc)
285
+
286
+ # else:
287
+ # # If no chunk_size is provided, create a single Document object for the row
288
+ # #doc_content = '. '.join([metadata_string, doc_content])
289
+ # doc = Document(page_content=doc_content, metadata=metadata)
290
+ # doc_sections.append(doc)
291
+
292
+ # message = "Data converted to document format. Now creating/loading document embeddings."
293
+ # print(message)
294
+
295
+ # return doc_sections, message
296
+
297
+ def csv_excel_text_to_docs(df, in_file, text_column='text', clean = "No", return_intermediate_files = "No", chunk_size=None, progress=gr.Progress()) -> List[Document]:
298
  """Converts a DataFrame's content to a list of dictionaries in the 'Document' format, containing page_content and associated metadata."""
299
 
300
+ file_list = [string.name for string in in_file]
301
+
302
+ data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
303
+ data_file_name = data_file_names[0]
304
+
305
+ # Check if file is a document format, and explode out as needed
306
+ if "prepared_docs" in data_file_name:
307
+ print("Loading in documents from file.")
308
+
309
+ #print(df[0:5])
310
+ #section_series = df.iloc[:,0]
311
+ #section_series = "{" + section_series + "}"
312
+
313
+ doc_sections = df
314
+
315
+ print(doc_sections[0])
316
+
317
+ # Convert each element in the Series to a Document instance
318
+ #doc_sections = section_series.apply(lambda x: Document(**x))
319
+
320
+ return doc_sections, "Finished preparing documents"
321
+ # df = document_to_dataframe(df.iloc[:,0])
322
+
323
  ingest_tic = time.perf_counter()
324
 
325
  doc_sections = []
326
  df[text_column] = df[text_column].astype(str).str.strip() # Ensure column is a string column
327
 
328
+ if clean == "Yes":
329
+ clean_tic = time.perf_counter()
330
+ print("Starting data clean.")
331
+
332
+ df = df.drop_duplicates(text_column)
333
+
334
+ df[text_column] = initial_clean(df[text_column])
335
+ df_list = list(df[text_column])
336
+
337
+ # Save to file if you have cleaned the data
338
+ out_file_name, text_column = save_prepared_bm25_data(data_file_name, df_list, df, text_column)
339
+
340
+ clean_toc = time.perf_counter()
341
+ clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
342
+ print(clean_time_out)
343
+
344
  cols = [col for col in df.columns if col != text_column]
345
 
346
  df["metadata"] = combine_metadata_columns(df, cols)
 
352
  #doc_sections = df[["page_content", "metadata"]].to_dict(orient='records')
353
  #doc_sections = [Document(**row) for row in df[["page_content", "metadata"]].to_dict(orient='records')]
354
 
355
+
356
  # Create a list of Document objects
357
  doc_sections = [Document(page_content=row['page_content'],
358
  metadata= parse_metadata(row["metadata"]))
359
+ for index, row in progress.tqdm(df.iterrows(), desc = "Splitting up text", unit = "rows")]
360
+
361
  ingest_toc = time.perf_counter()
362
 
363
  ingest_time_out = f"Preparing documents took {ingest_toc - ingest_tic:0.1f} seconds"
364
  print(ingest_time_out)
365
 
366
+ if return_intermediate_files == "Yes":
367
+ data_file_out_name_no_ext = get_file_path_end(data_file_name)
368
+ file_name = data_file_out_name_no_ext + "_cleaned"
369
+ #print(doc_sections)
370
+ #page_content_series_string = pd.Series(doc_sections).astype(str)
371
+ #page_content_series_string = page_content_series_string.str.replace(" type='Document'", "").str.replace("' metadata=", "', 'metadata':").str.replace("page_content=", "{'page_content':")
372
+ #page_content_series_string = page_content_series_string + "}"
373
+ #print(page_content_series_string[0])
374
+ #metadata_series_string = pd.Series(doc_sections[1]).astype(str)
375
 
376
+ import pickle
 
377
 
378
+ if clean == "No":
379
+ #pd.DataFrame(data = {"Documents":page_content_series_string}).to_parquet(file_name + "_prepared_docs.parquet")
380
 
381
+ with gzip.open(file_name + "_prepared_docs.pkl.gz", 'wb') as file:
382
+ pickle.dump(doc_sections, file)
383
 
384
+ #pd.Series(doc_sections).to_pickle(file_name + "_prepared_docs.pkl")
385
+ elif clean == "Yes":
386
+ #pd.DataFrame(data = {"Documents":page_content_series_string}).to_parquet(file_name + "_prepared_docs_clean.parquet")
387
 
388
+ with gzip.open(file_name + "_prepared_docs_clean.pkl.gz", 'wb') as file:
389
+ pickle.dump(doc_sections, file)
390
 
391
+ #pd.Series(doc_sections).to_pickle(file_name + "_prepared_docs_clean.pkl")
392
+ print("Documents saved to file.")
393
 
394
+ return doc_sections, "Finished preparing documents."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
395
 
 
396
 
397
+ def document_to_dataframe(documents):
398
+ '''
399
+ Convert an object in document format to pandas dataframe
400
+ '''
401
+ rows = []
402
 
403
+ for doc in documents:
404
+ # Convert Document to dictionary and extract metadata
405
+ doc_dict = doc.dict()
406
+ metadata = doc_dict.pop('metadata')
407
 
408
+ # Add the page_content and type to the metadata
409
+ metadata['page_content'] = doc_dict['page_content']
410
+ metadata['type'] = doc_dict['type']
411
 
412
+ # Add to the list of rows
413
+ rows.append(metadata)
414
 
415
+ # Create a DataFrame from the list of rows
416
+ df = pd.DataFrame(rows)
417
+ return df
418
 
419
+ # Example usage
420
+ #documents = [
421
+ # Document(page_content="Example content 1", metadata={"author": "Author 1", "year": 2021}),
422
+ # Document(page_content="Example content 2", metadata={"author": "Author 2", "year": 2022})
423
+ #]
424
 
425
+ #df = document_to_dataframe(documents)
426
+ #df