Spaces:

seanpedrickcase
/

data_text_search

Sleeping

App Files Files Community

seanpedrickcase commited on Jan 10, 2024

Commit

99d6fba

1 Parent(s): d3b1ac5

Many changes to code organisation. More efficient searches from using intermediate outputs. Version 0.1

Browse files

Files changed (14) hide show

.gitignore +5 -1
README.md +25 -5
app.py +46 -722
hook-en_core_web_sm.py +8 -0
hook-gradio.py +1 -2
how_to_create_exe_dist.txt +13 -5
requirements.txt +8 -12
search_funcs/{fast_bm25.py → bm25_functions.py} +229 -1
search_funcs/chatfuncs.py +0 -393
search_funcs/clean_funcs.py +38 -261
search_funcs/{ingest_text.py → convert_files_to_parquet.py} +0 -0
search_funcs/helper_functions.py +148 -0
search_funcs/semantic_functions.py +422 -0
search_funcs/{ingest.py → semantic_ingest_functions.py} +218 -203

.gitignore CHANGED Viewed

@@ -13,7 +13,11 @@
 *.ipynb
 *.npy
 *.npz
 build/*
 dist/*
 __pycache__/*
-db/*

 *.ipynb
 *.npy
 *.npz
+*.pkl
+*.pkl.gz
 build/*
 dist/*
 __pycache__/*
+db/*
+experiments/*
+model/*

README.md CHANGED Viewed

@@ -10,9 +10,10 @@ pinned: false
 license: apache-2.0
 ---
-Keyword search over your data. This is an adaptation of fast_bm25 (https://github.com/Inspirateur/Fast-BM25) to search over tabular data with a Gradio UI interface.
 # Guide
 1. Load in your tabular data file (.csv, .parquet, .xlsx - first sheet).
 2. Wait a few seconds for the file to upload, then in the dropdown menu below 'Enter the name of the text column...' choose the column from the data file that you want to search.
@@ -21,17 +22,36 @@ Keyword search over your data. This is an adaptation of fast_bm25 (https://githu
 5. Hit search text. You may have to wait depending on the size of the data you are searching.
 6. You will receive back 1. the top search result and 2. a csv of the search results found in the text ordered by relevance, joined onto the original columns from your data source.
 # Advanced options
-The search should perform well with default options, so you shouldn't need to change things here.
 ## Data load / save options
-Toggle 'Clean text during load...' to true if you want to remove html tags and lemmatise the text, i.e. remove the ends of words to retain the core of the word e.g. searched or searches becomes search. Early testing suggests that cleaning takes some time, and does not seem to improve quality of search results.
-## Search options
 Here are a few options to modify the BM25 search parameters. If you want more information on what each parameter does, click the relevant info button to the right of the sliders.
 ## Join on additional dataframes to results
-I was asked to include a feature to join on additional data to the search results. This could be useful for example if you have tabular text data associated with a person ID, and after searching you would like to join on information associated with this person to aid with post-search filtering/analysis.
 To do this:
 1. Load in the tabular data you want to join in the box (.csv, .parquet, .xlsx - first sheet).

 license: apache-2.0
 ---
+Search through long-form text fields in your tabular data. Either for exact, specific terms (Keyword search), or thematic, 'fuzzy' search (Semantic search).
 # Guide
+## Keyword search
 1. Load in your tabular data file (.csv, .parquet, .xlsx - first sheet).
 2. Wait a few seconds for the file to upload, then in the dropdown menu below 'Enter the name of the text column...' choose the column from the data file that you want to search.
 5. Hit search text. You may have to wait depending on the size of the data you are searching.
 6. You will receive back 1. the top search result and 2. a csv of the search results found in the text ordered by relevance, joined onto the original columns from your data source.
+## Semantic search
+This search type enables you to search for broader themes (e.g. happiness, nature) and the search will pick out text passages that relate to these themes even if they don't contain the exact words.
+1. Load in your tabular data file (.csv, .parquet, .xlsx - first sheet).
+2. Wait a few seconds for the file to upload, then in the dropdown menu below 'Enter the name of the text column...' choose the column from the data file that you want to search.
+3. Hit 'Load data'. The 'Load progress' text box will let you know when the file is ready.
+4. In the 'Enter semantic search query here' area below this, type in the terms you would like to search for.
+5. Press 'Start semantic search'. You may have to wait depending on the size of the data you are searching.
+6. You will receive back 1. the top search result and 2. a csv of the search results found in the text ordered by relevance, joined onto the original columns from your data source.
 # Advanced options
+The search should perform well with default options, so you shouldn't need to change things here. More details on each parameter is provided below.
 ## Data load / save options
+Toggle 'Clean text during load...' to "Yes" if you want to remove html tags and lemmatise the text, i.e. remove the ends of words to retain the core of the word e.g. searched or searches becomes search. Early testing suggests that cleaning takes some time, and does not seem to improve quality of search results.
+'Return intermediate files', when set to "Yes", will save a tokenised text file (for keyword search), or an embedded text file (for semantic search) during data preparation. These files can then be loaded in next time alongside the data files to save preparation time for future search sessions.
+'Round embeddings to three dp...' will reduce the precision of the embedding outputs to 3 decimal places, and will multiply all values by 100, reducing the size of the output numpy array by about 50%. It seems to have minimal effect on the output search result according to simple search comparisons, but I cannot guarantee this!
+## Keyword search options
 Here are a few options to modify the BM25 search parameters. If you want more information on what each parameter does, click the relevant info button to the right of the sliders.
+## Semantic search options
+The only option here currently is the minimum similarity distance that should be included in the results. The default works quite well, anything above 0.85 tends to return no results in my experience.
 ## Join on additional dataframes to results
+Join on additional data to the search results. This could be useful for example if you have tabular text data associated with a person ID, and after searching you would like to join on information associated with this person to aid with post-search filtering/analysis.
 To do this:
 1. Load in the tabular data you want to join in the box (.csv, .parquet, .xlsx - first sheet).

app.py CHANGED Viewed

@@ -1,707 +1,17 @@
-import nltk
-from typing import TypeVar
-nltk.download('names')
-nltk.download('stopwords')
-nltk.download('wordnet')
-nltk.download('punkt')
-from search_funcs.fast_bm25 import BM25
-from search_funcs.clean_funcs import initial_clean, get_lemma_tokens#, stem_sentence
-from nltk import word_tokenize
-#from sentence_transformers import SentenceTransformer
-# Try SpaCy alternative tokeniser
-PandasDataFrame = TypeVar('pd.core.frame.DataFrame')
 import gradio as gr
 import pandas as pd
-import numpy as np
-import os
-import time
-import math
-from itertools import islice
-from chromadb.config import Settings
-from transformers import AutoModel
-# Load the SpaCy mode
-from spacy.cli import download
-import spacy
-spacy.prefer_gpu()
-#os.system("python -m spacy download en_core_web_sm")
-try:
-    nlp = spacy.load("en_core_web_sm")
-except:
-    download("en_core_web_sm")
-    nlp = spacy.load("en_core_web_sm")
-# model = AutoModel.from_pretrained('./model_and_tokenizer/int8-model.onnx', use_embedding_runtime=True)
-# sentence_embeddings = model.generate(engine_input)['last_hidden_state:0']
-# print("Sentence embeddings:", sentence_embeddings)
-import search_funcs.ingest as ing
-#import search_funcs.chatfuncs as chatf
-# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
-import chromadb
-#from typing_extensions import Protocol
-#from chromadb import Documents, EmbeddingFunction, Embeddings
-from torch import cuda, backends, tensor, mm
-# Check for torch cuda
-print(cuda.is_available())
-print(backends.cudnn.enabled)
-if cuda.is_available():
-    torch_device = "cuda"
-    os.system("nvidia-smi")
-else:
-    torch_device =  "cpu"
-# Remove Chroma database file. If it exists as it can cause issues
-chromadb_file = "chroma.sqlite3"
-if os.path.isfile(chromadb_file):
-    os.remove(chromadb_file)
-def load_embeddings(embeddings_name = "jinaai/jina-embeddings-v2-small-en"):
-    '''
-    Load embeddings model and create a global variable based on it.
-    '''
-    # Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
-    #else:
-    embeddings_func = AutoModel.from_pretrained(embeddings_name, trust_remote_code=True, device_map="auto")
-    global embeddings
-    embeddings = embeddings_func
-    return embeddings
-# Load embeddings
-embeddings_name = "jinaai/jina-embeddings-v2-small-en"
-embeddings_model = AutoModel.from_pretrained(embeddings_name, trust_remote_code=True, device_map="auto")
-#embeddings_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
-#embeddings_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")
-#tokenizer = AutoTokenizer.from_pretrained(embeddings_name, device_map = "auto")#to(torch_device) # From Jina
-# Construction 2 - from SpaCy - https://spacy.io/api/tokenizer
-#from spacy.lang.en import English
-#nlp = #English()
-# Create a Tokenizer with the default settings for English
-# including punctuation rules and exceptions
-tokenizer = nlp.tokenizer
-embeddings = embeddings_model#load_embeddings(embeddings_name)
-def prepare_input_data(in_file, text_column, clean="No", progress=gr.Progress()):
-    file_list = [string.name for string in in_file]
-    print(file_list)
-    data_file_names = [string for string in file_list if "tokenised" not in string]
-    df = read_file(data_file_names[0])
-    ## Load in pre-tokenised corpus if exists
-    tokenised_df = pd.DataFrame()
-    tokenised_file_names = [string for string in file_list if "tokenised" in string]
-    if tokenised_file_names:
-        tokenised_df = read_file(tokenised_file_names[0])
-        print("Tokenised df is: ", tokenised_df.head())
-    #df = pd.read_parquet(file_in.name)
-    df_list = list(df[text_column].astype(str).str.lower())
-    # def get_total_batches(my_list, batch_size):
-    #     return math.ceil(len(my_list) / batch_size)
-    # def batch(iterable, batch_size):
-    #     iterator = iter(iterable)
-    #     for first in iterator:
-    #         yield [first] + list(islice(iterator, batch_size - 1))
-    batch_size = 256
-    tic = time.perf_counter()
-    if clean == "Yes":
-        df_list_clean = initial_clean(df_list)
-        # Save to file if you have cleaned the data
-        out_file_name = save_prepared_data(in_file, df_list_clean, df, text_column)
-        # Tokenize texts in batches
-        if not tokenised_df.empty:
-            corpus = tokenised_df.iloc[:,0].tolist()
-            print("Corpus is: ", corpus[0:5])
-        else:
-            corpus = []
-            for doc in tokenizer.pipe(progress.tqdm(df_list_clean, desc = "Tokenising text", unit = "rows"), batch_size=batch_size):
-                corpus.append([token.text for token in doc])
-    else:
-        print(df_list[0])
-        # Tokenize texts in batches
-        if not tokenised_df.empty:
-            corpus = tokenised_df.iloc[:,0].tolist()
-            print("Corpus is: ", corpus[0:5])
-        else:
-            corpus = []
-            for doc in tokenizer.pipe(progress.tqdm(df_list, desc = "Tokenising text", unit = "rows"), batch_size=batch_size):
-                corpus.append([token.text for token in doc])
-        out_file_name = None
-        print(corpus[0])
-    toc = time.perf_counter()
-    tokenizer_time_out = f"Tokenising the text took {toc - tic:0.1f} seconds"
-    print("Finished data clean. " + tokenizer_time_out)
-    if len(df_list) >= 20:
-        message = "Data loaded"
-    else:
-        message = "Data loaded. Warning: dataset may be too short to get consistent search results."
-    tokenised_data_file_name = "keyword_search_tokenised_data.parquet"
-    pd.DataFrame(data={"Corpus":corpus}).to_parquet(tokenised_data_file_name)
-    return corpus, message, df, out_file_name, tokenised_data_file_name
-def get_file_path_end(file_path):
-    # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
-    basename = os.path.basename(file_path)
-    # Then, split the basename and its extension and return only the basename without the extension
-    filename_without_extension, _ = os.path.splitext(basename)
-    print(filename_without_extension)
-    return filename_without_extension
-def save_prepared_data(in_file, prepared_text_list, in_df, in_bm25_column):
-    # Check if the list and the dataframe have the same length
-    if len(prepared_text_list) != len(in_df):
-        raise ValueError("The length of 'prepared_text_list' and 'in_df' must match.")
-    file_end = ".parquet"
-    file_name = get_file_path_end(in_file.name) + "_cleaned" + file_end
-    prepared_text_df = pd.DataFrame(data={in_bm25_column + "_cleaned":prepared_text_list})
-    # Drop original column from input file to reduce file size
-    in_df = in_df.drop(in_bm25_column, axis = 1)
-    prepared_df = pd.concat([in_df, prepared_text_df], axis = 1)
-    if file_end == ".csv":
-        prepared_df.to_csv(file_name)
-    elif file_end == ".parquet":
-        prepared_df.to_parquet(file_name)
-    else: file_name = None
-    return file_name
-def prepare_bm25(corpus, k1=1.5, b = 0.75, alpha=-5):
-    #bm25.save("saved_df_bm25")
-    #bm25 = BM25.load(re.sub(r'\.pkl$', '', file_in.name))
-    print("Preparing BM25 corpus")
-    global bm25
-    bm25 = BM25(corpus, k1=k1, b=b, alpha=alpha)
-    message = "Search parameters loaded."
-    print(message)
-    return message
-def convert_query_to_tokens(free_text_query, clean="No"):
-    '''
-    Split open text query into tokens and then lemmatise to get the core of the word
-    '''
-    if clean=="Yes":
-        split_query = word_tokenize(free_text_query.lower())
-        out_query = get_lemma_tokens(split_query)
-        #out_query = stem_sentence(free_text_query)
-    else:
-        split_query = word_tokenize(free_text_query.lower())
-        out_query = split_query
-    return out_query
-def bm25_search(free_text_query, in_no_search_results, original_data, text_column, clean = "No", in_join_file = None, in_join_column = "", search_df_join_column = ""):
-    # Prepare query
-    if (clean == "Yes") | (text_column.endswith("_cleaned")):
-        token_query = convert_query_to_tokens(free_text_query, clean="Yes")
-    else:
-        token_query = convert_query_to_tokens(free_text_query, clean="No")
-    print(token_query)
-    # Perform search
-    print("Searching")
-    results_index, results_text, results_scores = bm25.extract_documents_and_scores(token_query, bm25.corpus, n=in_no_search_results) #bm25.corpus #original_data[text_column]
-    if not results_index:
-        return "No search results found", None, token_query
-    print("Search complete")
-    # Prepare results and export
-    joined_texts = [' '.join(inner_list) for inner_list in results_text]
-    results_df = pd.DataFrame(data={"index": results_index,
-                                    "search_text": joined_texts,
-                                    "search_score_abs": results_scores})
-    results_df['search_score_abs'] = abs(round(results_df['search_score_abs'], 2))
-    results_df_out = results_df[['index', 'search_text', 'search_score_abs']].merge(original_data,left_on="index", right_index=True, how="left")#.drop("index", axis=1)
-    # Join on additional files
-    if in_join_file:
-        join_filename = in_join_file.name
-        # Import data
-        join_df = read_file(join_filename)
-        join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
-        results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
-        # Duplicates dropped so as not to expand out dataframe
-        join_df = join_df.drop_duplicates(in_join_column)
-        results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
-    # Reorder results by score
-    results_df_out = results_df_out.sort_values('search_score_abs', ascending=False)
-    # Out file
-    results_df_name = "search_result.csv"
-    results_df_out.to_csv(results_df_name, index= None)
-    results_first_text = results_df_out[text_column].iloc[0]
-    print("Returning results")
-    return results_first_text, results_df_name, token_query
-def detect_file_type(filename):
-    """Detect the file type based on its extension."""
-    if (filename.endswith('.csv')) | (filename.endswith('.csv.gz')) | (filename.endswith('.zip')):
-        return 'csv'
-    elif filename.endswith('.xlsx'):
-        return 'xlsx'
-    elif filename.endswith('.parquet'):
-        return 'parquet'
-    else:
-        raise ValueError("Unsupported file type.")
-def read_file(filename):
-    """Read the file based on its detected type."""
-    file_type = detect_file_type(filename)
-    if file_type == 'csv':
-        return pd.read_csv(filename, low_memory=False).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
-    elif file_type == 'xlsx':
-        return pd.read_excel(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
-    elif file_type == 'parquet':
-        return pd.read_parquet(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
-def put_columns_in_df(in_file, in_bm25_column):
-    '''
-    When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
-    '''
-    file_list = [string.name for string in in_file]
-    print(file_list)
-    data_file_names = [string for string in file_list if "tokenised" not in string]
-    new_choices = []
-    concat_choices = []
-    df = read_file(data_file_names[0])
-    new_choices = list(df.columns)
-    #print(new_choices)
-    concat_choices.extend(new_choices)
-    return gr.Dropdown(choices=concat_choices), gr.Dropdown(value="No", choices = ["Yes", "No"]),\
-        gr.Dropdown(choices=concat_choices)
-def put_columns_in_join_df(in_file, in_bm25_column):
-    '''
-    When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
-    '''
-    print("in_bm25_column")
-    new_choices = []
-    concat_choices = []
-    df = read_file(in_file.name)
-    new_choices = list(df.columns)
-    print(new_choices)
-    concat_choices.extend(new_choices)
-    return gr.Dropdown(choices=concat_choices)
-def dummy_function(gradio_component):
-    """
-    A dummy function that exists just so that dropdown updates work correctly.
-    """
-    return None
-def display_info(info_component):
-    gr.Info(info_component)
-def docs_to_chroma_save(docs_out, embeddings = embeddings, progress=gr.Progress()):
-    '''
-    Takes a Langchain document class and saves it into a Chroma sqlite file.
-    '''
-    print(f"> Total split documents: {len(docs_out)}")
-    #print(docs_out)
-    page_contents = [doc.page_content for doc in docs_out]
-    page_meta = [doc.metadata for doc in docs_out]
-    ids_range = range(0,len(page_contents))
-    ids = [str(element) for element in ids_range]
-    tic = time.perf_counter()
-    #embeddings_list = []
-    #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
-    #    embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
-    embeddings_list = embeddings.encode(sentences=page_contents, max_length=256, show_progress_bar = True, batch_size = 32).tolist() # For Jina embeddings
-    #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
-    #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
-    toc = time.perf_counter()
-    time_out = f"The embedding took {toc - tic:0.1f} seconds"
-    #pd.Series(embeddings_list).to_csv("embeddings_out.csv")
-    # Jina tiny
-    # This takes about 300 seconds for 240,000 records = 800 / second, 1024 max length
-    # For 50k records:
-    # 61 seconds at 1024 max length
-    # 55 seconds at 512 max length
-    # 43 seconds at 256 max length
-    # 31 seconds at 128 max length
-    # The embedding took 1372.5 seconds at 256 max length for 655,020 case notes
-    # BGE small
-    # 96 seconds for 50k records at 512 length
-    # all-MiniLM-L6-v2
-    # 42.5 seconds at (256?) max length
-    # paraphrase-MiniLM-L3-v2
-    # 22 seconds for 128 max length
-    print(time_out)
-    chroma_tic = time.perf_counter()
-    # Create a new Chroma collection to store the documents and metadata. We don't need to specify an embedding fuction, and the default will be used.
-    client = chromadb.PersistentClient(path="./last_year", settings=Settings(
-    anonymized_telemetry=False))
-    try:
-        print("Deleting existing collection.")
-        #collection = client.get_collection(name="my_collection")
-        client.delete_collection(name="my_collection")
-        print("Creating new collection.")
-        collection = client.create_collection(name="my_collection")
-    except:
-        print("Creating new collection.")
-        collection = client.create_collection(name="my_collection")
-    # Match batch size is about 40,000, so add that amount in a loop
-    def create_batch_ranges(in_list, batch_size=40000):
-        total_rows = len(in_list)
-        ranges = []
-        for start in range(0, total_rows, batch_size):
-            end = min(start + batch_size, total_rows)
-            ranges.append(range(start, end))
-        return ranges
-    batch_ranges = create_batch_ranges(embeddings_list)
-    print(batch_ranges)
-    for row_range in progress.tqdm(batch_ranges, desc = "Creating vector database", unit = "batches of 40,000 rows"):
-        collection.add(
-        documents = page_contents[row_range[0]:row_range[-1]],
-        embeddings = embeddings_list[row_range[0]:row_range[-1]],
-        metadatas = page_meta[row_range[0]:row_range[-1]],
-        ids = ids[row_range[0]:row_range[-1]])
-    print(collection.count())
-    #chatf.vectorstore = vectorstore_func
-    chroma_toc = time.perf_counter()
-    chroma_time_out = f"Loading to Chroma db took {chroma_toc - chroma_tic:0.1f} seconds"
-    print(chroma_time_out)
-    out_message = "Document processing complete"
-    return out_message, collection
-def docs_to_np_array(docs_out, in_file, embeddings = embeddings, progress=gr.Progress()):
-    '''
-    Takes a Langchain document class and saves it into a Chroma sqlite file.
-    '''
-    print(f"> Total split documents: {len(docs_out)}")
-    #print(docs_out)
-    page_contents = [doc.page_content for doc in docs_out]
-    ## Load in pre-embedded file if exists
-    file_list = [string.name for string in in_file]
-    #print(file_list)
-    embeddings_file_names = [string for string in file_list if "embedding" in string]
-    out_message = "Document processing complete. Ready to search."
-    if embeddings_file_names:
-        embeddings_out = np.load(embeddings_file_names[0])['arr_0']
-        print("embeddings loaded: ", embeddings_out)
-    if not embeddings_file_names:
-        tic = time.perf_counter()
-        #embeddings_list = []
-        #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
-        #    embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
-        embeddings_out = embeddings.encode(sentences=page_contents, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina embeddings
-        #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
-        #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
-        print(embeddings_out)
-        embeddings_out_round = np.round(embeddings_out, 4)
-        toc = time.perf_counter()
-        time_out = f"The embedding took {toc - tic:0.1f} seconds"
-        semantic_search_file_name = 'semantic_search_embeddings.npz'
-        semantic_search_rounded_file_name = 'semantic_search_embeddings_rounded.npz'
-        np.savez_compressed(semantic_search_file_name, embeddings_out)
-        np.savez_compressed(semantic_search_rounded_file_name, embeddings_out_round)
-        return out_message, embeddings_out, semantic_search_file_name, semantic_search_rounded_file_name
-    print(out_message)
-    return out_message, embeddings_out, None, None
-def process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column):
-    def create_docs_keep_from_df(df):
-        dict_out = {'ids' : [df['ids']],
-                    'documents': [df['documents']],
-                    'metadatas': [df['metadatas']],
-                    'distances': [round(df['distances'].astype(float), 3)],
-                    'embeddings': None
-                    }
-        return dict_out
-    # Prepare the DataFrame by transposing
-    #df_docs = df#.apply(lambda x: x.explode()).reset_index(drop=True)
-    # Keep only documents with a certain score
-    #print(df_docs)
-    docs_scores = df_docs["distances"] #.astype(float)
-    # Only keep sources that are sufficiently relevant (i.e. similarity search score below threshold below)
-    score_more_limit = df_docs.loc[docs_scores > vec_score_cut_off, :]
-    #docs_keep = create_docs_keep_from_df(score_more_limit) #list(compress(docs, score_more_limit))
-    #print(docs_keep)
-    if score_more_limit.empty:
-        return 'No result found!', None
-    # Only keep sources that are at least 100 characters long
-    docs_len = score_more_limit["documents"].str.len() >= 100
-    #print(docs_len)
-    length_more_limit = score_more_limit.loc[docs_len == True, :] #pd.Series(docs_len) >= 100
-    #docs_keep = create_docs_keep_from_df(length_more_limit) #list(compress(docs_keep, length_more_limit))
-    #print(length_more_limit)
-    if length_more_limit.empty:
-        return 'No result found!', None
-    length_more_limit['ids'] = length_more_limit['ids'].astype(int)
-    #length_more_limit.to_csv("length_more_limit.csv", index = None)
-    # Explode the 'metadatas' dictionary into separate columns
-    df_metadata_expanded = length_more_limit['metadatas'].apply(pd.Series)
-    #print(length_more_limit)
-    #print(df_metadata_expanded)
-    # Concatenate the original DataFrame with the expanded metadata DataFrame
-    results_df_out = pd.concat([length_more_limit.drop('metadatas', axis=1), df_metadata_expanded], axis=1)
-    results_df_out = results_df_out.rename(columns={"documents":orig_df_col})
-    results_df_out = results_df_out.drop(["page_section", "row", "source", "id"], axis=1, errors="ignore")
-    results_df_out['distances'] = round(results_df_out['distances'].astype(float), 3)
-    # Join back to original df
-    # results_df_out = orig_df.merge(length_more_limit[['ids', 'distances']], left_index = True, right_on = "ids", how="inner").sort_values("distances")
-    # Join on additional files
-    if in_join_file:
-        join_filename = in_join_file.name
-        # Import data
-        join_df = read_file(join_filename)
-        join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
-        # Duplicates dropped so as not to expand out dataframe
-        join_df = join_df.drop_duplicates(in_join_column)
-        results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
-        results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
-    return results_df_out
-def jina_simple_retrieval(new_question_kworded, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
-                           vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None, device = torch_device, embeddings = embeddings, progress=gr.Progress()): # ,vectorstore, embeddings
-    print("vectorstore loaded: ", vectorstore)
-    # Convert it to a PyTorch tensor and transfer to GPU
-    vectorstore_tensor = tensor(vectorstore).to(device)
-    # Load the sentence transformer model and move it to GPU
-    embeddings = embeddings.to(device)
-    # Encode the query using the sentence transformer and convert to a PyTorch tensor
-    query = embeddings.encode(new_question_kworded)
-    query_tensor = tensor(query).to(device)
-    if query_tensor.dim() == 1:
-        query_tensor = query_tensor.unsqueeze(0)  # Reshape to 2D with one row
-    # Normalize the query tensor and vectorstore tensor
-    query_norm = query_tensor / query_tensor.norm(dim=1, keepdim=True)
-    vectorstore_norm = vectorstore_tensor / vectorstore_tensor.norm(dim=1, keepdim=True)
-    # Calculate cosine similarities (batch processing)
-    cosine_similarities = mm(query_norm, vectorstore_norm.T)
-    # Flatten the tensor to a 1D array
-    cosine_similarities = cosine_similarities.flatten()
-    # Convert to a NumPy array if it's still a PyTorch tensor
-    cosine_similarities = cosine_similarities.cpu().numpy()
-    # Create a Pandas Series
-    cosine_similarities_series = pd.Series(cosine_similarities)
-    # Pull out relevent info from docs
-    page_contents = [doc.page_content for doc in docs]
-    page_meta = [doc.metadata for doc in docs]
-    ids_range = range(0,len(page_contents))
-    ids = [str(element) for element in ids_range]
-    df_docs = pd.DataFrame(data={"ids": ids,
-                                "documents": page_contents,
-                                    "metadatas":page_meta,
-                                    "distances":cosine_similarities_series}).sort_values("distances", ascending=False).iloc[0:k_val,:]
-    results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
-    results_df_name = "semantic_search_result.csv"
-    results_df_out.to_csv(results_df_name, index= None)
-    results_first_text = results_df_out.iloc[0, 1]
-    return results_first_text, results_df_name
-def chroma_retrieval(new_question_kworded:str, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
-                           vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None): # ,vectorstore, embeddings
-            query = embeddings.encode(new_question_kworded).tolist()
-            docs = vectorstore.query(
-            query_embeddings=query,
-            n_results= k_val # No practical limit on number of responses returned
-            #where={"metadata_field": "is_equal_to_this"},
-            #where_document={"$contains":"search_string"}
-            )
-            df_docs = pd.DataFrame(data={'ids': docs['ids'][0],
-                                    'documents': docs['documents'][0],
-                                    'metadatas':docs['metadatas'][0],
-                                    'distances':docs['distances'][0]#,
-                                    #'embeddings': docs['embeddings']
-                                    })
-            results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
-            results_df_name = "semantic_search_result.csv"
-            results_df_out.to_csv(results_df_name, index= None)
-            results_first_text = results_df_out[orig_df_col].iloc[0]
-            return results_first_text, results_df_name
 ## Gradio app - BM25 search
 block = gr.Blocks(theme = gr.themes.Base())
@@ -716,7 +26,6 @@ with block:
     k_val = gr.State(9999)
     out_passages = gr.State(9999)
-    vec_score_cut_off = gr.State(0.7)
     vec_weight = gr.State(1)
     docs_keep_as_doc_state = gr.State()
@@ -740,11 +49,17 @@ depends on factors such as the type of documents or queries. Information taken f
     gr.Markdown(
     """
-    # Fast text search
-    Enter a text query below to search through a text data column and find relevant terms. It will only find terms containing the exact text you enter. Your data should contain at least 20 entries for the search to consistently return results.
     """)
     with gr.Tab(label="Keyword search"):
         with gr.Row():
             current_source = gr.Textbox(label="Current data source(s)", value="None")
@@ -760,7 +75,7 @@ depends on factors such as the type of documents or queries. Information taken f
         with gr.Accordion(label = "Search data", open=True):
             with gr.Row():
                 keyword_query = gr.Textbox(label="Enter your search term")
-                mod_query = gr.Textbox(label="Cleaned search term (the terms that are passed to the search engine)")
             keyword_search_button = gr.Button(value="Search text")
@@ -768,12 +83,18 @@ depends on factors such as the type of documents or queries. Information taken f
                 output_single_text = gr.Textbox(label="Top result")
                 output_file = gr.File(label="File output")
-    with gr.Tab("Fuzzy/semantic search"):
         with gr.Row():
             current_source_semantic = gr.Textbox(label="Current data source(s)", value="None")
         with gr.Accordion("Load in data", open = True):
-            in_semantic_file = gr.File(label="Upload data file for semantic search", file_count= 'multiple', file_types = ['.parquet', '.csv', '.npy', '.npz'])
             with gr.Row():
                 in_semantic_column = gr.Dropdown(label="Enter the name of the text column in the data file to search")
@@ -789,11 +110,13 @@ depends on factors such as the type of documents or queries. Information taken f
             semantic_output_file = gr.File(label="File output")
     with gr.Tab(label="Advanced options"):
-        with gr.Accordion(label="Data load / save options", open = False):
-            #with gr.Row():
-            in_clean_data = gr.Dropdown(label = "Clean text during load (remove tags, stem words). This will take some time!", value="No", choices=["Yes", "No"])
             #save_clean_data_button = gr.Button(value = "Save loaded data to file", scale = 1)
-        with gr.Accordion(label="Search options", open = False):
             with gr.Row():
                 in_k1 = gr.Slider(label = "k1 value", value = 1.5, minimum = 0.1, maximum = 5, step = 0.1, scale = 3)
                 in_k1_button = gr.Button(value = "k1 value info", scale = 1)
@@ -808,6 +131,8 @@ depends on factors such as the type of documents or queries. Information taken f
                 in_no_search_results_button = gr.Button(value = "Search results number info", scale = 1)
             with gr.Row():
                 in_search_param_button = gr.Button(value="Load search parameters (Need to click this if you changed anything above)")
         with gr.Accordion(label = "Join on additional dataframes to results", open = False):
             in_join_file = gr.File(label="Upload your data to join here")
             in_join_column = gr.Dropdown(label="Column to join in new data frame")
@@ -823,29 +148,28 @@ depends on factors such as the type of documents or queries. Information taken f
     ### BM25 SEARCH ###
     # Update dropdowns upon initial file load
-    in_bm25_file.upload(put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column])
     in_join_file.upload(put_columns_in_join_df, inputs=[in_join_file, in_join_column], outputs=[in_join_column])
     # Load in BM25 data
-    load_bm25_data_button.click(fn=prepare_input_data, inputs=[in_bm25_file, in_bm25_column, in_clean_data], outputs=[corpus_state, load_finished_message, data_state, output_file, output_file]).\
-    then(fn=prepare_bm25, inputs=[corpus_state, in_k1, in_b, in_alpha], outputs=[load_finished_message]).\
-    then(fn=put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column])
     # BM25 search functions on click or enter
-    keyword_search_button.click(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file, mod_query], api_name="keyword")
-    keyword_query.submit(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file, mod_query])
     ### SEMANTIC SEARCH ###
     # Load in a csv/excel file for semantic search
-    in_semantic_file.upload(put_columns_in_df, inputs=[in_semantic_file, in_semantic_column], outputs=[in_semantic_column, in_clean_data, search_df_join_column])
-    load_semantic_data_button.click(ing.parse_csv_or_excel, inputs=[in_semantic_file, in_semantic_column], outputs=[ingest_text, current_source_semantic, semantic_load_progress]).\
-             then(ing.csv_excel_text_to_docs, inputs=[ingest_text, in_semantic_column], outputs=[ingest_docs, semantic_load_progress]).\
-             then(docs_to_np_array, inputs=[ingest_docs, in_semantic_file], outputs=[semantic_load_progress, vectorstore_state, semantic_output_file, semantic_output_file])
     # Semantic search query
-    semantic_submit.click(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, vec_score_cut_off, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file], api_name="semantic")
-    semantic_query.submit(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, vec_score_cut_off, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file])
     # Dummy functions just to get dropdowns to work correctly with Gradio 3.50
     in_bm25_column.change(dummy_function, in_bm25_column, None)

+from typing import Type
+from search_funcs.bm25_functions import prepare_bm25_input_data, prepare_bm25, bm25_search
+from search_funcs.semantic_ingest_functions import parse_csv_or_excel, csv_excel_text_to_docs
+from search_funcs.semantic_functions import docs_to_jina_embed_np_array, jina_simple_retrieval
+from search_funcs.helper_functions import dummy_function, display_info, put_columns_in_df, put_columns_in_join_df, get_temp_folder_path, empty_folder
 import gradio as gr
 import pandas as pd
+PandasDataFrame = Type[pd.DataFrame]
+# Attempt to delete temporary files generated by previous use of the app (as the files can be very big!)
+temp_folder_path = get_temp_folder_path()
+empty_folder(temp_folder_path)
 ## Gradio app - BM25 search
 block = gr.Blocks(theme = gr.themes.Base())
     k_val = gr.State(9999)
     out_passages = gr.State(9999)
     vec_weight = gr.State(1)
     docs_keep_as_doc_state = gr.State()
     gr.Markdown(
     """
+    # Data text search
+    Search through long-form text fields in your tabular data. Either for exact, specific terms (Keyword search), or thematic, 'fuzzy' search (Semantic search). More instructions are provided in the relevant tabs below.
     """)
     with gr.Tab(label="Keyword search"):
+        gr.Markdown(
+    """
+    **Exact term keyword search**
+    1. Load in data file (ideally a file with '_cleaned' at the end of the name), with (optionally) the '...tokenised_data.parquet' in the same folder to save loading time. 2. Select the field in your data to search. Ideally this will have the suffix '_cleaned' to show that html tags have been removed. 3. Wait for the data file to be prepared for search. 4. Enter the search term in the relevant box below and press Enter/click on 'Search text'. 4. Your search results will be saved in a csv file and will be presented in the 'File output' area below.
+    """)
         with gr.Row():
             current_source = gr.Textbox(label="Current data source(s)", value="None")
         with gr.Accordion(label = "Search data", open=True):
             with gr.Row():
                 keyword_query = gr.Textbox(label="Enter your search term")
+                #mod_query = gr.Textbox(label="Cleaned search term (the terms that are passed to the search engine)")
             keyword_search_button = gr.Button(value="Search text")
                 output_single_text = gr.Textbox(label="Top result")
                 output_file = gr.File(label="File output")
+    with gr.Tab("Semantic search"):
+        gr.Markdown(
+    """
+    **Thematic/semantic search**
+    This search type enables you to search for broader themes (e.g. happiness, nature) and the search will pick out text passages that relate to these themes even if they don't contain the exact words. 1. Load in data file (ideally a file with '_cleaned' at the end of the name), with (optionally) the 'semantic_search_embeddings.npz' in the same folder to save loading time. 2. Select the field in your data to search. Ideally this will have the suffix '_cleaned' to show that html tags have been removed. 3. Wait for the data file to be prepared for search. 4. Enter the search term in the 'Enter semantic search query here' box below and press Enter/click on 'Start semantic search'. 4. Your search results will be saved in a csv file and will be presented in the 'File output' area below.
+    """)
         with gr.Row():
             current_source_semantic = gr.Textbox(label="Current data source(s)", value="None")
         with gr.Accordion("Load in data", open = True):
+            in_semantic_file = gr.File(label="Upload data file for semantic search", file_count= 'multiple', file_types = ['.parquet', '.csv', '.npy', '.npz', '.pkl', '.pkl.gz'])
             with gr.Row():
                 in_semantic_column = gr.Dropdown(label="Enter the name of the text column in the data file to search")
             semantic_output_file = gr.File(label="File output")
     with gr.Tab(label="Advanced options"):
+        with gr.Accordion(label="Data load / save options", open = True):
+            with gr.Row():
+                in_clean_data = gr.Dropdown(label = "Clean text during load (remove html tags). For large files this may take some time!", value="No", choices=["Yes", "No"])
+                return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation. Files can be loaded in to save processing time in future.", value="No", choices=["Yes", "No"])
+                embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp for smaller files with less accuracy.", value="No", choices=["Yes", "No"])
             #save_clean_data_button = gr.Button(value = "Save loaded data to file", scale = 1)
+        with gr.Accordion(label="Keyword search options", open = False):
             with gr.Row():
                 in_k1 = gr.Slider(label = "k1 value", value = 1.5, minimum = 0.1, maximum = 5, step = 0.1, scale = 3)
                 in_k1_button = gr.Button(value = "k1 value info", scale = 1)
                 in_no_search_results_button = gr.Button(value = "Search results number info", scale = 1)
             with gr.Row():
                 in_search_param_button = gr.Button(value="Load search parameters (Need to click this if you changed anything above)")
+        with gr.Accordion(label="Semantic search options", open = False):
+            semantic_min_distance = gr.Slider(label = "Minimum distance score for search result to be included", value = 0.7, minimum=0, maximum=0.95, step=0.01)
         with gr.Accordion(label = "Join on additional dataframes to results", open = False):
             in_join_file = gr.File(label="Upload your data to join here")
             in_join_column = gr.Dropdown(label="Column to join in new data frame")
     ### BM25 SEARCH ###
     # Update dropdowns upon initial file load
+    in_bm25_file.upload(put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column, data_state])
     in_join_file.upload(put_columns_in_join_df, inputs=[in_join_file, in_join_column], outputs=[in_join_column])
     # Load in BM25 data
+    load_bm25_data_button.click(fn=prepare_bm25_input_data, inputs=[in_bm25_file, in_bm25_column, data_state, in_clean_data, return_intermediate_files], outputs=[corpus_state, load_finished_message, data_state, output_file, output_file, current_source]).\
+    then(fn=prepare_bm25, inputs=[corpus_state, in_k1, in_b, in_alpha], outputs=[load_finished_message])#.\
+    #then(fn=put_columns_in_df, inputs=[in_bm25_file, in_bm25_column], outputs=[in_bm25_column, in_clean_data, search_df_join_column])
     # BM25 search functions on click or enter
+    keyword_search_button.click(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file], api_name="keyword")
+    keyword_query.submit(fn=bm25_search, inputs=[keyword_query, in_no_search_results, data_state, in_bm25_column, in_clean_data, in_join_file, in_join_column, search_df_join_column], outputs=[output_single_text, output_file])
     ### SEMANTIC SEARCH ###
     # Load in a csv/excel file for semantic search
+    in_semantic_file.upload(put_columns_in_df, inputs=[in_semantic_file, in_semantic_column], outputs=[in_semantic_column, in_clean_data, search_df_join_column, data_state])
+    load_semantic_data_button.click(parse_csv_or_excel, inputs=[in_semantic_file, data_state, in_semantic_column], outputs=[ingest_text, current_source_semantic, semantic_load_progress]).\
+             then(csv_excel_text_to_docs, inputs=[ingest_text, in_semantic_file, in_semantic_column, in_clean_data, return_intermediate_files], outputs=[ingest_docs, semantic_load_progress]).\
+             then(docs_to_jina_embed_np_array, inputs=[ingest_docs, in_semantic_file, return_intermediate_files, embedding_super_compress], outputs=[semantic_load_progress, vectorstore_state, semantic_output_file])
     # Semantic search query
+    semantic_submit.click(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, semantic_min_distance, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file], api_name="semantic")
+    semantic_query.submit(jina_simple_retrieval, inputs=[semantic_query, vectorstore_state, ingest_docs, in_semantic_column, k_val, out_passages, semantic_min_distance, vec_weight, in_join_file, in_join_column, search_df_join_column], outputs=[semantic_output_single_text, semantic_output_file])
     # Dummy functions just to get dropdowns to work correctly with Gradio 3.50
     in_bm25_column.change(dummy_function, in_bm25_column, None)

hook-en_core_web_sm.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from PyInstaller.utils.hooks import collect_data_files
+hiddenimports = [
+    'en_core_web_sm'
+]
+# Use collect_data_files to find data files. Replace 'en_core_web_sm' with the correct package name if it's different.
+datas = collect_data_files('en_core_web_sm')

hook-gradio.py CHANGED Viewed

@@ -1,8 +1,7 @@
 from PyInstaller.utils.hooks import collect_data_files
 hiddenimports = [
-    'gradio',
-    # Add any other submodules that PyInstaller doesn't detect
 ]
 # Use collect_data_files to find data files. Replace 'gradio' with the correct package name if it's different.

 from PyInstaller.utils.hooks import collect_data_files
 hiddenimports = [
+    'gradio'
 ]
 # Use collect_data_files to find data files. Replace 'gradio' with the correct package name if it's different.

how_to_create_exe_dist.txt CHANGED Viewed

@@ -4,18 +4,26 @@
 3. cd to this folder. Install packages from requirements.txt using 'pip install -r requirements.txt'
 4. In file explorer, navigate to the miniconda/envs/new_env/Lib/site-packages/gradio-client/ folder
 5. Copy types.json from the gradio_client folder to the folder containing the data_text_search.py file
-6. pip install pyinstaller
-7. In command line, cd to this folder. Then run the following 'python -m PyInstaller --additional-hooks-dir=. --hidden-import pyarrow.vendored.version --add-data="types.json;gradio_client" --clean --onefile --clean --name DataSearchApp data_text_search.py'
-8. A 'dist' folder will be created with the executable inside along with all dependencies('dist\data_text_search').
-9. In file explorer, navigate to the miniconda/envs/new_env/Lib/site-packages/gradio/ folder. Copy the entire folder. Paste this into the new distributable subfolder 'dist\data_text_search\_internal'
-10. In 'dist\data_text_search' try double clicking on the .exe file. After a short delay, the command prompt should inform you about the ip address of the app that is now running. Copy the ip address, but do not close this window.
 11. In an Internet browser, navigate to the indicated IP address. The app should now be running in your browser window.

 3. cd to this folder. Install packages from requirements.txt using 'pip install -r requirements.txt'
+NOTE: for ensuring that spaCy models are loaded into the program correctly in requirements.txt, follow this guide: https://spacy.io/usage/models#models-download
 4. In file explorer, navigate to the miniconda/envs/new_env/Lib/site-packages/gradio-client/ folder
 5. Copy types.json from the gradio_client folder to the folder containing the data_text_search.py file
+6. If necessary, create hook- files to tell pyinstaller to include specific packages in the exe build. Examples are provided for gradio and en_core_web_sm (a spaCy model).
+7. pip install pyinstaller
+8. In command line, cd to the folder that contains app.py. Then run the following:
+For one single file:
+python -m PyInstaller --additional-hooks-dir=. --hidden-import pyarrow.vendored.version --add-data="types.json;gradio_client" --add-data "model;model" --onefile --clean --noconfirm --upx-dir="C:\Program Files\UPX\upx-4.2.2-win64" --name DataSearchApp_0.1 app.py
+For a small exe with a folder of dependencies:
+python -m PyInstaller --additional-hooks-dir=. --hidden-import pyarrow.vendored.version --add-data="types.json;gradio_client" --add-data "model;model" --clean --noconfirm --upx-dir="C:\Program Files\UPX\upx-4.2.2-win64" --name DataSearchApp_0.1 app.py
+9. A 'dist' folder will be created with the executable inside along with all dependencies('dist\data_text_search').
+10. In 'dist\data_text_search' try double clicking on the .exe file. After a short delay, the command prompt should inform you about the IP address of the app that is now running. Copy the IP address. **Do not close this window!**
 11. In an Internet browser, navigate to the indicated IP address. The app should now be running in your browser window.

requirements.txt CHANGED Viewed

@@ -1,13 +1,9 @@
-pandas
-nltk
-pyarrow
-openpyxl
-transformers
-langchain
-chromadb
-torch
-accelerate
-sentence-transformers
-spacy
-polars
 gradio==3.50.0

+pandas==2.1.4
+polars==0.20.3
+pyarrow==14.0.2
+openpyxl==3.1.2
+transformers==4.32.1
+torch==2.1.2
+spacy==3.7.2
+en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 gradio==3.50.0

search_funcs/{fast_bm25.py → bm25_functions.py} RENAMED Viewed

@@ -3,14 +3,44 @@ import heapq
 import math
 import pickle
 import sys
 from numpy import inf
 import gradio as gr
 PARAM_K1 = 1.5
 PARAM_B = 0.75
 IDF_CUTOFF = -inf
-# Built off https://github.com/Inspirateur/Fast-BM25
 class BM25:
 	"""Fast Implementation of Best Matching 25 ranking function.
@@ -196,3 +226,201 @@ class BM25:
 	def load(filename):
 		with open(f"{filename}.pkl", "rb") as fsave:
 			return pickle.load(fsave)

 import math
 import pickle
 import sys
+import time
+import pandas as pd
 from numpy import inf
 import gradio as gr
+from datetime import datetime
+today_rev = datetime.now().strftime("%Y%m%d")
+from search_funcs.clean_funcs import initial_clean # get_lemma_tokens, stem_sentence
+from search_funcs.helper_functions import read_file, get_file_path_end_with_ext, get_file_path_end
+# Load the SpaCy model
+from spacy.cli import download
+import spacy
+spacy.prefer_gpu()
+#os.system("python -m spacy download en_core_web_sm")
+try:
+	import en_core_web_sm
+	nlp = en_core_web_sm.load()
+	print("Successfully imported spaCy model")
+    #nlp = spacy.load("en_core_web_sm")
+    #print(nlp._path)
+except:
+	download("en_core_web_sm")
+	nlp = spacy.load("en_core_web_sm")
+	print("Successfully imported spaCy model")
+    #print(nlp._path)
+# including punctuation rules and exceptions
+tokenizer = nlp.tokenizer
 PARAM_K1 = 1.5
 PARAM_B = 0.75
 IDF_CUTOFF = -inf
+# Class built off https://github.com/Inspirateur/Fast-BM25
 class BM25:
 	"""Fast Implementation of Best Matching 25 ranking function.
 	def load(filename):
 		with open(f"{filename}.pkl", "rb") as fsave:
 			return pickle.load(fsave)
+# These following functions are my own work
+def prepare_bm25_input_data(in_file, text_column, data_state, clean="No",  return_intermediate_files = "No", progress=gr.Progress()):
+	file_list = [string.name for string in in_file]
+	#print(file_list)
+	data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
+	data_file_name = data_file_names[0]
+	df = data_state #read_file(data_file_name)
+	data_file_out_name = get_file_path_end_with_ext(data_file_name)
+	data_file_out_name_no_ext = get_file_path_end(data_file_name)
+	## Load in pre-tokenised corpus if exists
+	tokenised_df = pd.DataFrame()
+	tokenised_file_names = [string for string in file_list if "tokenised" in string]
+	if tokenised_file_names:
+		tokenised_df = read_file(tokenised_file_names[0])
+		#print("Tokenised df is: ", tokenised_df.head())
+	#df = pd.read_parquet(file_in.name)
+	df[text_column] = df[text_column].astype(str).str.lower()
+	if clean == "Yes":
+		clean_tic = time.perf_counter()
+		print("Starting data clean.")
+		df = df.drop_duplicates(text_column)
+		df_list = list(df[text_column])
+		df_list = initial_clean(df_list)
+		# Save to file if you have cleaned the data
+		out_file_name, text_column = save_prepared_bm25_data(data_file_name, df_list, df, text_column)
+		clean_toc = time.perf_counter()
+		clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
+		print(clean_time_out)
+	else:
+		# Don't clean or save file to disk
+		df_list = list(df[text_column])
+		print("No data cleaning performed.")
+		out_file_name = None
+	# Tokenise data. If tokenised df already exists, no need to do anything
+	if not tokenised_df.empty:
+		corpus = tokenised_df.iloc[:,0].tolist()
+		print("Tokeniser loaded from file.")
+		#print("Corpus is: ", corpus[0:5])
+	# If doesn't already exist, tokenize texts in batches
+	else:
+		tokeniser_tic = time.perf_counter()
+		corpus = []
+		batch_size = 256
+		for doc in tokenizer.pipe(progress.tqdm(df_list, desc = "Tokenising text", unit = "rows"), batch_size=batch_size):
+			corpus.append([token.text for token in doc])
+		tokeniser_toc = time.perf_counter()
+		tokenizer_time_out = f"Tokenising the text took {tokeniser_toc - tokeniser_tic:0.1f} seconds."
+		print(tokenizer_time_out)
+	if len(df_list) >= 20:
+		message = "Data loaded"
+	else:
+		message = "Data loaded. Warning: dataset may be too short to get consistent search results."
+	if return_intermediate_files == "Yes":
+		tokenised_data_file_name = data_file_out_name_no_ext + "_" + "keyword_search_tokenised_data.parquet"
+		pd.DataFrame(data={"Corpus":corpus}).to_parquet(tokenised_data_file_name)
+		return corpus, message, df, out_file_name, tokenised_data_file_name, data_file_out_name
+	return corpus, message, df, out_file_name, None, data_file_out_name # tokenised_data_file_name
+def save_prepared_bm25_data(in_file_name, prepared_text_list, in_df, in_bm25_column):
+	# Check if the list and the dataframe have the same length
+	if len(prepared_text_list) != len(in_df):
+		raise ValueError("The length of 'prepared_text_list' and 'in_df' must match.")
+	file_end = ".parquet"
+	file_name = get_file_path_end(in_file_name) + "_cleaned" + file_end
+	new_text_column = in_bm25_column + "_cleaned"
+	prepared_text_df = pd.DataFrame(data={new_text_column:prepared_text_list})
+	# Drop original column from input file to reduce file size
+	in_df = in_df.drop(in_bm25_column, axis = 1)
+	prepared_df = pd.concat([in_df, prepared_text_df], axis = 1)
+	if file_end == ".csv":
+		prepared_df.to_csv(file_name)
+	elif file_end == ".parquet":
+		prepared_df.to_parquet(file_name)
+	else: file_name = None
+	return file_name, new_text_column
+def prepare_bm25(corpus, k1=1.5, b = 0.75, alpha=-5):
+    #bm25.save("saved_df_bm25")
+    #bm25 = BM25.load(re.sub(r'\.pkl$', '', file_in.name))
+    print("Preparing BM25 corpus")
+    global bm25
+    bm25 = BM25(corpus, k1=k1, b=b, alpha=alpha)
+    message = "Search parameters loaded."
+    print(message)
+    return message
+def convert_bm25_query_to_tokens(free_text_query, clean="No"):
+    '''
+    Split open text query into tokens and then lemmatise to get the core of the word. Currently 'clean' has no effect.
+    '''
+    if clean=="Yes":
+        split_query = tokenizer(free_text_query.lower())
+        out_query = [token.text for token in split_query]
+        #out_query = stem_sentence(out_query)
+    else:
+        split_query = tokenizer(free_text_query.lower())
+        out_query = [token.text for token in split_query]
+    print("Search query out is:", out_query)
+    if isinstance(out_query,str):
+        print("Converting string")
+        out_query = [out_query]
+    return out_query
+def bm25_search(free_text_query, in_no_search_results, original_data, text_column, clean = "No", in_join_file = None, in_join_column = "", search_df_join_column = ""):
+    # Prepare query
+    if (clean == "Yes") | (text_column.endswith("_cleaned")):
+        token_query = convert_bm25_query_to_tokens(free_text_query, clean="Yes")
+    else:
+        token_query = convert_bm25_query_to_tokens(free_text_query, clean="No")
+    #print(token_query)
+    # Perform search
+    print("Searching")
+    results_index, results_text, results_scores = bm25.extract_documents_and_scores(token_query, bm25.corpus, n=in_no_search_results) #bm25.corpus #original_data[text_column]
+    if not results_index:
+        return "No search results found", None, token_query
+    print("Search complete")
+    # Prepare results and export
+    joined_texts = [' '.join(inner_list) for inner_list in results_text]
+    results_df = pd.DataFrame(data={"index": results_index,
+                                    "search_text": joined_texts,
+                                    "search_score_abs": results_scores})
+    results_df['search_score_abs'] = abs(round(results_df['search_score_abs'], 2))
+    results_df_out = results_df[['index', 'search_text', 'search_score_abs']].merge(original_data,left_on="index", right_index=True, how="left")#.drop("index", axis=1)
+    # Join on additional files
+    if in_join_file:
+        join_filename = in_join_file.name
+        # Import data
+        join_df = read_file(join_filename)
+        join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
+        results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
+        # Duplicates dropped so as not to expand out dataframe
+        join_df = join_df.drop_duplicates(in_join_column)
+        results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
+    # Reorder results by score
+    results_df_out = results_df_out.sort_values('search_score_abs', ascending=False)
+    # Out file
+    results_df_name = "keyword_search_result_" + today_rev + ".csv"
+    results_df_out.to_csv(results_df_name, index= None)
+    results_first_text = results_df_out[text_column].iloc[0]
+    print("Returning results")
+    return results_first_text, results_df_name, token_query

search_funcs/chatfuncs.py DELETED Viewed

@@ -1,393 +0,0 @@
-import re
-import os
-from typing import TypeVar, List
-import pandas as pd
-# Model packages
-import torch.cuda
-# Alternative model sources
-#from dataclasses import asdict, dataclass
-# Langchain functions
-from langchain.text_splitter import RecursiveCharacterTextSplitter
-from langchain.docstore.document import Document
-# For keyword extraction (not currently used)
-#import nltk
-#nltk.download('wordnet')
-from nltk.corpus import stopwords
-from nltk.tokenize import RegexpTokenizer
-from nltk.stem import WordNetLemmatizer
-# For Name Entity Recognition model
-#from span_marker import SpanMarkerModel # Not currently used
-import gradio as gr
-torch.cuda.empty_cache()
-PandasDataFrame = TypeVar('pd.core.frame.DataFrame')
-embeddings = None  # global variable setup
-vectorstore = None # global variable setup
-model_type = None # global variable setup
-max_memory_length = 0 # How long should the memory of the conversation last?
-full_text = "" # Define dummy source text (full text) just to enable highlight function to load
-model = [] # Define empty list for model functions to run
-tokenizer = [] # Define empty list for model functions to run
-## Highlight text constants
-hlt_chunk_size = 12
-hlt_strat = [" ", ". ", "! ", "? ", ": ", "\n\n", "\n", ", "]
-hlt_overlap = 4
-## Initialise NER model ##
-ner_model = []#SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-multinerd") # Not currently used
-# Currently set gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
-if torch.cuda.is_available():
-    torch_device = "cuda"
-    gpu_layers = 0
-else:
-    torch_device =  "cpu"
-    gpu_layers = 0
-print("Running on device:", torch_device)
-threads = 6 #torch.get_num_threads()
-print("CPU threads:", threads)
-# Vectorstore funcs
-# Prompt functions
-def write_out_metadata_as_string(metadata_in):
-    metadata_string = [f"{'  '.join(f'{k}: {v}' for k, v in d.items() if k != 'page_section')}" for d in metadata_in] # ['metadata']
-    return metadata_string
-def determine_file_type(file_path):
-        """
-        Determine the file type based on its extension.
-        Parameters:
-            file_path (str): Path to the file.
-        Returns:
-            str: File extension (e.g., '.pdf', '.docx', '.txt', '.html').
-        """
-        return os.path.splitext(file_path)[1].lower()
-def create_doc_df(docs_keep_out):
-    # Extract content and metadata from 'winning' passages.
-            content=[]
-            meta=[]
-            meta_url=[]
-            page_section=[]
-            score=[]
-            doc_df = pd.DataFrame()
-            for item in docs_keep_out:
-                content.append(item[0].page_content)
-                meta.append(item[0].metadata)
-                meta_url.append(item[0].metadata['source'])
-                file_extension = determine_file_type(item[0].metadata['source'])
-                if (file_extension != ".csv") & (file_extension != ".xlsx"):
-                    page_section.append(item[0].metadata['page_section'])
-                else: page_section.append("")
-                score.append(item[1])
-            # Create df from 'winning' passages
-            doc_df = pd.DataFrame(list(zip(content, meta, page_section, meta_url, score)),
-               columns =['page_content', 'metadata', 'page_section', 'meta_url', 'score'])
-            docs_content = doc_df['page_content'].astype(str)
-            doc_df['full_url'] = "https://" + doc_df['meta_url']
-            return doc_df
-def get_expanded_passages(vectorstore, docs, width):
-    """
-    Extracts expanded passages based on given documents and a width for context.
-    Parameters:
-    - vectorstore: The primary data source.
-    - docs: List of documents to be expanded.
-    - width: Number of documents to expand around a given document for context.
-    Returns:
-    - expanded_docs: List of expanded Document objects.
-    - doc_df: DataFrame representation of expanded_docs.
-    """
-    from collections import defaultdict
-    def get_docs_from_vstore(vectorstore):
-        vector = vectorstore.docstore._dict
-        return list(vector.items())
-    def extract_details(docs_list):
-        docs_list_out = [tup[1] for tup in docs_list]
-        content = [doc.page_content for doc in docs_list_out]
-        meta = [doc.metadata for doc in docs_list_out]
-        return ''.join(content), meta[0], meta[-1]
-    def get_parent_content_and_meta(vstore_docs, width, target):
-        #target_range = range(max(0, target - width), min(len(vstore_docs), target + width + 1))
-        target_range = range(max(0, target), min(len(vstore_docs), target + width + 1)) # Now only selects extra passages AFTER the found passage
-        parent_vstore_out = [vstore_docs[i] for i in target_range]
-        content_str_out, meta_first_out, meta_last_out = [], [], []
-        for _ in parent_vstore_out:
-            content_str, meta_first, meta_last = extract_details(parent_vstore_out)
-            content_str_out.append(content_str)
-            meta_first_out.append(meta_first)
-            meta_last_out.append(meta_last)
-        return content_str_out, meta_first_out, meta_last_out
-    def merge_dicts_except_source(d1, d2):
-            merged = {}
-            for key in d1:
-                if key != "source":
-                    merged[key] = str(d1[key]) + " to " + str(d2[key])
-                else:
-                    merged[key] = d1[key]  # or d2[key], based on preference
-            return merged
-    def merge_two_lists_of_dicts(list1, list2):
-        return [merge_dicts_except_source(d1, d2) for d1, d2 in zip(list1, list2)]
-    # Step 1: Filter vstore_docs
-    vstore_docs = get_docs_from_vstore(vectorstore)
-    doc_sources = {doc.metadata['source'] for doc, _ in docs}
-    vstore_docs = [(k, v) for k, v in vstore_docs if v.metadata.get('source') in doc_sources]
-    # Step 2: Group by source and proceed
-    vstore_by_source = defaultdict(list)
-    for k, v in vstore_docs:
-        vstore_by_source[v.metadata['source']].append((k, v))
-    expanded_docs = []
-    for doc, score in docs:
-        search_source = doc.metadata['source']
-        #if file_type == ".csv" | file_type == ".xlsx":
-        #     content_str, meta_first, meta_last = get_parent_content_and_meta(vstore_by_source[search_source], 0, search_index)
-        #else:
-        search_section = doc.metadata['page_section']
-        parent_vstore_meta_section = [doc.metadata['page_section'] for _, doc in vstore_by_source[search_source]]
-        search_index = parent_vstore_meta_section.index(search_section) if search_section in parent_vstore_meta_section else -1
-        content_str, meta_first, meta_last = get_parent_content_and_meta(vstore_by_source[search_source], width, search_index)
-        meta_full = merge_two_lists_of_dicts(meta_first, meta_last)
-        expanded_doc = (Document(page_content=content_str[0], metadata=meta_full[0]), score)
-        expanded_docs.append(expanded_doc)
-    doc_df = pd.DataFrame()
-    doc_df = create_doc_df(expanded_docs)  # Assuming you've defined the 'create_doc_df' function elsewhere
-    return expanded_docs, doc_df
-def highlight_found_text(search_text: str, full_text: str, hlt_chunk_size:int=hlt_chunk_size, hlt_strat:List=hlt_strat, hlt_overlap:int=hlt_overlap) -> str:
-    """
-    Highlights occurrences of search_text within full_text.
-    Parameters:
-    - search_text (str): The text to be searched for within full_text.
-    - full_text (str): The text within which search_text occurrences will be highlighted.
-    Returns:
-    - str: A string with occurrences of search_text highlighted.
-    Example:
-    >>> highlight_found_text("world", "Hello, world! This is a test. Another world awaits.")
-    'Hello, <mark style="color:black;">world</mark>! This is a test. Another <mark style="color:black;">world</mark> awaits.'
-    """
-    def extract_text_from_input(text, i=0):
-        if isinstance(text, str):
-            return text.replace("  ", " ").strip()
-        elif isinstance(text, list):
-            return text[i][0].replace("  ", " ").strip()
-        else:
-            return ""
-    def extract_search_text_from_input(text):
-        if isinstance(text, str):
-            return text.replace("  ", " ").strip()
-        elif isinstance(text, list):
-            return text[-1][1].replace("  ", " ").strip()
-        else:
-            return ""
-    full_text = extract_text_from_input(full_text)
-    search_text = extract_search_text_from_input(search_text)
-    text_splitter = RecursiveCharacterTextSplitter(
-        chunk_size=hlt_chunk_size,
-        separators=hlt_strat,
-        chunk_overlap=hlt_overlap,
-    )
-    sections = text_splitter.split_text(search_text)
-    found_positions = {}
-    for x in sections:
-        text_start_pos = 0
-        while text_start_pos != -1:
-            text_start_pos = full_text.find(x, text_start_pos)
-            if text_start_pos != -1:
-                found_positions[text_start_pos] = text_start_pos + len(x)
-                text_start_pos += 1
-    # Combine overlapping or adjacent positions
-    sorted_starts = sorted(found_positions.keys())
-    combined_positions = []
-    if sorted_starts:
-        current_start, current_end = sorted_starts[0], found_positions[sorted_starts[0]]
-        for start in sorted_starts[1:]:
-            if start <= (current_end + 10):
-                current_end = max(current_end, found_positions[start])
-            else:
-                combined_positions.append((current_start, current_end))
-                current_start, current_end = start, found_positions[start]
-        combined_positions.append((current_start, current_end))
-    # Construct pos_tokens
-    pos_tokens = []
-    prev_end = 0
-    for start, end in combined_positions:
-        if end-start > 15: # Only combine if there is a significant amount of matched text. Avoids picking up single words like 'and' etc.
-            pos_tokens.append(full_text[prev_end:start])
-            pos_tokens.append('<mark style="color:black;">' + full_text[start:end] + '</mark>')
-            prev_end = end
-    pos_tokens.append(full_text[prev_end:])
-    return "".join(pos_tokens)
-# # Chat history functions
-def clear_chat(chat_history_state, sources, chat_message, current_topic):
-    chat_history_state = []
-    sources = ''
-    chat_message = ''
-    current_topic = ''
-    return chat_history_state, sources, chat_message, current_topic
-# Keyword functions
-def remove_q_stopwords(question): # Remove stopwords from question. Not used at the moment
-    # Prepare keywords from question by removing stopwords
-    text = question.lower()
-    # Remove numbers
-    text = re.sub('[0-9]', '', text)
-    tokenizer = RegexpTokenizer(r'\w+')
-    text_tokens = tokenizer.tokenize(text)
-    #text_tokens = word_tokenize(text)
-    tokens_without_sw = [word for word in text_tokens if not word in stopwords]
-    # Remove duplicate words while preserving order
-    ordered_tokens = set()
-    result = []
-    for word in tokens_without_sw:
-        if word not in ordered_tokens:
-            ordered_tokens.add(word)
-            result.append(word)
-    new_question_keywords = ' '.join(result)
-    return new_question_keywords
-def remove_q_ner_extractor(question):
-    predict_out = ner_model.predict(question)
-    predict_tokens = [' '.join(v for k, v in d.items() if k == 'span') for d in predict_out]
-    # Remove duplicate words while preserving order
-    ordered_tokens = set()
-    result = []
-    for word in predict_tokens:
-        if word not in ordered_tokens:
-            ordered_tokens.add(word)
-            result.append(word)
-    new_question_keywords = ' '.join(result).lower()
-    return new_question_keywords
-def apply_lemmatize(text, wnl=WordNetLemmatizer()):
-    def prep_for_lemma(text):
-        # Remove numbers
-        text = re.sub('[0-9]', '', text)
-        print(text)
-        tokenizer = RegexpTokenizer(r'\w+')
-        text_tokens = tokenizer.tokenize(text)
-        #text_tokens = word_tokenize(text)
-        return text_tokens
-    tokens = prep_for_lemma(text)
-    def lem_word(word):
-        if len(word) > 3: out_word = wnl.lemmatize(word)
-        else: out_word = word
-        return out_word
-    return [lem_word(token) for token in tokens]
-def keybert_keywords(text, n, kw_model):
-    tokens_lemma = apply_lemmatize(text)
-    lemmatised_text = ' '.join(tokens_lemma)
-    keywords_text = KeyBERT(model=kw_model).extract_keywords(lemmatised_text, stop_words='english', top_n=n,
-                                                   keyphrase_ngram_range=(1, 1))
-    keywords_list = [item[0] for item in keywords_text]
-    return keywords_list
-# Gradio functions
-def turn_off_interactivity(user_message, history):
-        return gr.update(value="", interactive=False), history + [[user_message, None]]
-def restore_interactivity():
-        return gr.update(interactive=True)
-def update_message(dropdown_value):
-        return gr.Textbox.update(value=dropdown_value)
-def hide_block():
-        return gr.Radio.update(visible=False)

search_funcs/clean_funcs.py CHANGED Viewed

@@ -1,51 +1,14 @@
 # ## Some functions to clean text
-# ### Some other suggested cleaning approaches
-#
-# #### From here: https://shravan-kuchkula.github.io/topic-modeling/#interactive-plot-showing-results-of-k-means-clustering-lda-topic-modeling-and-sentiment-analysis
-#
-# - remove_hyphens
-# - tokenize_text
-# - remove_special_characters
-# - convert to lower case
-# - remove stopwords
-# - lemmatize the token
-# - remove short tokens
-# - keep only words in wordnet
-# - I ADDED ON - creating custom stopwords list
-# +
-# Create a custom stop words list
-import nltk
 import re
 import string
 import polars as pl
-from nltk.stem import WordNetLemmatizer
-from nltk.stem import PorterStemmer
-from nltk.corpus import wordnet as wn
-from nltk import word_tokenize
 # Add calendar months onto stop words
 import calendar
-from tqdm import tqdm
 import gradio as gr
-stemmer = PorterStemmer()
-nltk.download('stopwords')
-nltk.download('wordnet')
-#nltk.download('words')
-#nltk.download('names')
-#nltk.corpus.words.words('en')
-#from sklearn.feature_extraction import text
-# Adding common names to stopwords
-all_names = [x.lower() for x in list(nltk.corpus.names.words())]
 # Adding custom words to the stopwords
 custom_words = []
 my_stop_words = custom_words
@@ -58,72 +21,9 @@ cal_month = [x.lower() for x in cal_month]
 cal_month = [i for i in cal_month if i]
 #print(cal_month)
 custom_words.extend(cal_month)
-#my_stop_words = frozenset(text.ENGLISH_STOP_WORDS.union(custom_words).union(all_names))
-#custom_stopwords = my_stop_words
-# -
-# #### Some of my cleaning functions
-'''
-# +
-# Remove all html elements from the text. Inspired by this: https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
-def remove_email_start(text):
-  cleanr = re.compile('.*importance:|.*subject:')
-  cleantext = re.sub(cleanr, '', text)
-  return cleantext
-def remove_email_end(text):
-  cleanr = re.compile('kind regards.*|many thanks.*|sincerely.*')
-  cleantext = re.sub(cleanr, '', text)
-  return cleantext
-def cleanhtml(text):
-  cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});|\xa0')
-  cleantext = re.sub(cleanr, '', text)
-  return cleantext
-## The above doesn't work when there is no > at the end of the string to match the initial <. Trying this: <[^>]+> but needs work: https://stackoverflow.com/questions/2013124/regex-matching-up-to-the-first-occurrence-of-a-character
-# Remove all email addresses and numbers from the text
-def cleanemail(text):
-  cleanr = re.compile('\S*@\S*\s?|\xa0')
-  cleantext = re.sub(cleanr, '', text)
-  return cleantext
-def cleannum(text):
-  cleanr = re.compile(r'[0-9]+')
-  cleantext = re.sub(cleanr, '', text)
-  return cleantext
-def cleanpostcode(text):
-  cleanr = re.compile(r'(\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2})|((GIR ?0A{2})\b$)|(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9]{1}?)$)|(\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?)\b$)')
-  cleantext = re.sub(cleanr, '', text)
-  return cleantext
-def cleanwarning(text):
-  cleanr = re.compile('caution: this email originated from outside of the organization. do not click links or open attachments unless you recognize the sender and know the content is safe.')
-  cleantext = re.sub(cleanr, '', text)
-  return cleantext
-# -
-def initial_clean(texts):
-    clean_texts = []
-    for text in texts:
-        text = remove_email_start(text)
-        text = remove_email_end(text)
-        text = cleanpostcode(text)
-        text = remove_hyphens(text)
-        text = cleanhtml(text)
-        text = cleanemail(text)
-        #text = cleannum(text)
-        clean_texts.append(text)
-    return clean_texts
-'''
 email_start_pattern_regex = r'.*importance:|.*subject:'
 email_end_pattern_regex = r'kind regards.*|many thanks.*|sincerely.*'
 html_pattern_regex = r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});|\xa0|&nbsp;'
@@ -143,130 +43,65 @@ postcode_pattern = re.compile(postcode_pattern_regex)
 warning_pattern = re.compile(warning_pattern_regex)
 nbsp_pattern = re.compile(nbsp_pattern_regex)
-def stem_sentence(sentence):
-    words = sentence.split()
-    stemmed_words = [stemmer.stem(word).lower().rstrip("'") for word in words]
-    return stemmed_words
-def stem_sentences(sentences, progress=gr.Progress()):
-        """Stem each sentence in a list of sentences."""
-        stemmed_sentences = [stem_sentence(sentence) for sentence in progress.tqdm(sentences)]
-        return stemmed_sentences
-def get_lemma_text(text):
-    # Tokenize the input string into words
-    tokens = word_tokenize(text)
-    lemmas = []
-    for word in tokens:
-        if len(word) > 3:
-            lemma = wn.morphy(word)
-        else:
-            lemma = None
-        if lemma is None:
-            lemmas.append(word)
-        else:
-            lemmas.append(lemma)
-    return lemmas
-def get_lemma_tokens(tokens):
     # Tokenize the input string into words
-    lemmas = []
-    for word in tokens:
-        if len(word) > 3:
-            lemma = wn.morphy(word)
-        else:
-            lemma = None
-        if lemma is None:
-            lemmas.append(word)
-        else:
-            lemmas.append(lemma)
-    return lemmas
-# def initial_clean(texts , progress=gr.Progress()):
-#     clean_texts = []
-#     i = 1
-#     #progress(0, desc="Cleaning texts")
-#     for text in progress.tqdm(texts, desc = "Cleaning data", unit = "rows"):
-#         #print("Cleaning row: ", i)
-#         text = re.sub(email_start_pattern, '', text)
-#         text = re.sub(email_end_pattern, '', text)
-#         text = re.sub(postcode_pattern, '', text)
-#         text = remove_hyphens(text)
-#         text = re.sub(html_pattern, '', text)
-#         text = re.sub(email_pattern, '', text)
-#         text = re.sub(nbsp_pattern, '', text)
-#         #text = re.sub(warning_pattern, '', text)
-#         #text = stem_sentence(text)
-#         text = get_lemma_text(text)
-#         text = ' '.join(text)
-#         # Uncomment the next line if you want to remove numbers as well
-#         # text = re.sub(num_pattern, '', text)
-#         clean_texts.append(text)
-#         i += 1
-#     return clean_texts
 def initial_clean(texts , progress=gr.Progress()):
     texts = pl.Series(texts)#[]
-    #i = 1
-    #progress(0, desc="Cleaning texts")
-    #for text in progress.tqdm(texts, desc = "Cleaning data", unit = "rows"):
-    #print("Cleaning row: ", i)
     text = texts.str.replace_all(email_start_pattern_regex, '')
     text = text.str.replace_all(email_end_pattern_regex, '')
-    #text = re.sub(postcode_pattern, '', text)
-    #text = remove_hyphens(text)
     text = text.str.replace_all(html_pattern_regex, '')
     text = text.str.replace_all(email_pattern_regex, '')
-    #text = re.sub(nbsp_pattern, '', text)
-    #text = re.sub(warning_pattern, '', text)
-    #text = stem_sentence(text)
-    #text = get_lemma_text(text)
-    #text = ' '.join(text)
-    # Uncomment the next line if you want to remove numbers as well
-    # text = re.sub(num_pattern, '', text)
-    #clean_texts.append(text)
-    #i += 1
     text = text.to_list()
     return text
-# Sample execution
-#sample_texts = [
-#    "Hello, this is a test email. kind regards, John",
-#    "<div>Email content here</div> many thanks, Jane",
-#   "caution: this email originated from outside of the organization. do not click links or open attachments unless you recognize the sender and know the content is safe.",
-#    "john.doe123@example.com",
-#    "Address: 1234 Elm St, AB12 3CD"
-#]
-#initial_clean(sample_texts)
-# +
-all_names = [x.lower() for x in list(nltk.corpus.names.words())]
 def remove_hyphens(text_text):
     return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
-# tokenize text
-def tokenize_text(text_text):
-    TOKEN_PATTERN = r'\s+'
-    regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=True)
-    word_tokens = regex_wt.tokenize(text_text)
-    return word_tokens
 def remove_characters_after_tokenization(tokens):
     pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
@@ -276,80 +111,22 @@ def remove_characters_after_tokenization(tokens):
 def convert_to_lowercase(tokens):
     return [token.lower() for token in tokens if token.isalpha()]
-def remove_stopwords(tokens, custom_stopwords):
-    stopword_list = nltk.corpus.stopwords.words('english')
-    stopword_list += my_stop_words
-    filtered_tokens = [token for token in tokens if token not in stopword_list]
-    return filtered_tokens
-def remove_names(tokens):
-    stopword_list = list(nltk.corpus.names.words())
-    stopword_list = [x.lower() for x in stopword_list]
-    filtered_tokens = [token for token in tokens if token not in stopword_list]
-    return filtered_tokens
 def remove_short_tokens(tokens):
     return [token for token in tokens if len(token) > 3]
-def keep_only_words_in_wordnet(tokens):
-    return [token for token in tokens if wn.synsets(token)]
-def apply_lemmatize(tokens, wnl=WordNetLemmatizer()):
-    def lem_word(word):
-        if len(word) > 3: out_word = wnl.lemmatize(word)
-        else: out_word = word
-        return out_word
-    return [lem_word(token) for token in tokens]
-# +
-### Do the cleaning
-def cleanTexttexts(texts):
-    clean_texts = []
-    for text in texts:
-        #text = remove_email_start(text)
-        #text = remove_email_end(text)
-        text = remove_hyphens(text)
-        text = cleanhtml(text)
-        text = cleanemail(text)
-        text = cleanpostcode(text)
-        text = cleannum(text)
-        #text = cleanwarning(text)
-        text_i = tokenize_text(text)
-        text_i = remove_characters_after_tokenization(text_i)
-        #text_i = remove_names(text_i)
-        text_i = convert_to_lowercase(text_i)
-        #text_i = remove_stopwords(text_i, my_stop_words)
-        text_i = get_lemma(text_i)
-        #text_i = remove_short_tokens(text_i)
-        text_i = keep_only_words_in_wordnet(text_i)
-        text_i = apply_lemmatize(text_i)
-        clean_texts.append(text_i)
-    return clean_texts
-# -
 def remove_dups_text(data_samples_ready, data_samples_clean, data_samples):
    # Identify duplicates in the data: https://stackoverflow.com/questions/44191465/efficiently-identify-duplicates-in-large-list-500-000
     # Only identifies the second duplicate
     seen = set()
-    dupes = []
     for i, doi in enumerate(data_samples_ready):
         if doi not in seen:
             seen.add(doi)
         else:
-            dupes.append(i)
     #data_samples_ready[dupes[0:]]
     # To see a specific duplicated value you know the position of

 # ## Some functions to clean text
 import re
 import string
 import polars as pl
 # Add calendar months onto stop words
 import calendar
+#from tqdm import tqdm
 import gradio as gr
 # Adding custom words to the stopwords
 custom_words = []
 my_stop_words = custom_words
 cal_month = [i for i in cal_month if i]
 #print(cal_month)
 custom_words.extend(cal_month)
+# #### Some of my cleaning functions
 email_start_pattern_regex = r'.*importance:|.*subject:'
 email_end_pattern_regex = r'kind regards.*|many thanks.*|sincerely.*'
 html_pattern_regex = r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});|\xa0|&nbsp;'
 warning_pattern = re.compile(warning_pattern_regex)
 nbsp_pattern = re.compile(nbsp_pattern_regex)
+# def stem_sentence(sentence):
+#     words = sentence.split()
+#     stemmed_words = [stemmer.stem(word).lower().rstrip("'") for word in words]
+#     return stemmed_words
+# def stem_sentences(sentences, progress=gr.Progress()):
+#         """Stem each sentence in a list of sentences."""
+#         stemmed_sentences = [stem_sentence(sentence) for sentence in progress.tqdm(sentences)]
+#         return stemmed_sentences
+# def get_lemma_text(text):
+#     # Tokenize the input string into words
+#     tokens = word_tokenize(text)
+#     lemmas = []
+#     for word in tokens:
+#         if len(word) > 3:
+#             lemma = wn.morphy(word)
+#         else:
+#             lemma = None
+#         if lemma is None:
+#             lemmas.append(word)
+#         else:
+#             lemmas.append(lemma)
+#     return lemmas
+# def get_lemma_tokens(tokens):
     # Tokenize the input string into words
+    # lemmas = []
+    # for word in tokens:
+    #     if len(word) > 3:
+    #         lemma = wn.morphy(word)
+    #     else:
+    #         lemma = None
+    #     if lemma is None:
+    #         lemmas.append(word)
+    #     else:
+    #         lemmas.append(lemma)
+    # return lemmas
 def initial_clean(texts , progress=gr.Progress()):
     texts = pl.Series(texts)#[]
     text = texts.str.replace_all(email_start_pattern_regex, '')
     text = text.str.replace_all(email_end_pattern_regex, '')
     text = text.str.replace_all(html_pattern_regex, '')
     text = text.str.replace_all(email_pattern_regex, '')
     text = text.to_list()
     return text
 def remove_hyphens(text_text):
     return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
 def remove_characters_after_tokenization(tokens):
     pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
 def convert_to_lowercase(tokens):
     return [token.lower() for token in tokens if token.isalpha()]
 def remove_short_tokens(tokens):
     return [token for token in tokens if len(token) > 3]
 def remove_dups_text(data_samples_ready, data_samples_clean, data_samples):
    # Identify duplicates in the data: https://stackoverflow.com/questions/44191465/efficiently-identify-duplicates-in-large-list-500-000
     # Only identifies the second duplicate
     seen = set()
+    dups = []
     for i, doi in enumerate(data_samples_ready):
         if doi not in seen:
             seen.add(doi)
         else:
+            dups.append(i)
     #data_samples_ready[dupes[0:]]
     # To see a specific duplicated value you know the position of

search_funcs/{ingest_text.py → convert_files_to_parquet.py} RENAMED Viewed

File without changes

search_funcs/helper_functions.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import os
+import re
+import pandas as pd
+import gradio as gr
+import os
+import shutil
+import os
+import shutil
+import getpass
+import gzip
+import pickle
+# Attempt to delete content of gradio temp folder
+def get_temp_folder_path():
+    username = getpass.getuser()
+    return os.path.join('C:\\Users', username, 'AppData\\Local\\Temp\\gradio')
+def empty_folder(directory_path):
+    if not os.path.exists(directory_path):
+        #print(f"The directory {directory_path} does not exist. No temporary files from previous app use found to delete.")
+        return
+    for filename in os.listdir(directory_path):
+        file_path = os.path.join(directory_path, filename)
+        try:
+            if os.path.isfile(file_path) or os.path.islink(file_path):
+                os.unlink(file_path)
+            elif os.path.isdir(file_path):
+                shutil.rmtree(file_path)
+        except Exception as e:
+            #print(f'Failed to delete {file_path}. Reason: {e}')
+            print('')
+def get_file_path_end(file_path):
+    # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
+    basename = os.path.basename(file_path)
+    # Then, split the basename and its extension and return only the basename without the extension
+    filename_without_extension, _ = os.path.splitext(basename)
+    #print(filename_without_extension)
+    return filename_without_extension
+def get_file_path_end_with_ext(file_path):
+    match = re.search(r'(.*[\/\\])?(.+)$', file_path)
+    filename_end = match.group(2) if match else ''
+    return filename_end
+def detect_file_type(filename):
+    """Detect the file type based on its extension."""
+    if (filename.endswith('.csv')) | (filename.endswith('.csv.gz')) | (filename.endswith('.zip')):
+        return 'csv'
+    elif filename.endswith('.xlsx'):
+        return 'xlsx'
+    elif filename.endswith('.parquet'):
+        return 'parquet'
+    elif filename.endswith('.pkl.gz'):
+        return 'pkl.gz'
+    else:
+        raise ValueError("Unsupported file type.")
+def read_file(filename):
+    """Read the file based on its detected type."""
+    file_type = detect_file_type(filename)
+    print("Loading in file")
+    if file_type == 'csv':
+        file = pd.read_csv(filename, low_memory=False).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
+    elif file_type == 'xlsx':
+        file = pd.read_excel(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
+    elif file_type == 'parquet':
+        file = pd.read_parquet(filename).reset_index().drop(["index", "Unnamed: 0"], axis=1, errors="ignore")
+    elif file_type == 'pkl.gz':
+        with gzip.open(filename, 'rb') as file:
+            file = pickle.load(file)
+            #file = pd.read_pickle(filename)
+    print("File load complete")
+    return file
+def put_columns_in_df(in_file, in_bm25_column):
+    '''
+    When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
+    '''
+    file_list = [string.name for string in in_file]
+    #print(file_list)
+    data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
+    data_file_name = data_file_names[0]
+    new_choices = []
+    concat_choices = []
+    df = read_file(data_file_name)
+    if "pkl" not in data_file_name:
+        new_choices = list(df.columns)
+    else: new_choices = ["page_contents"] + list(df[0].metadata.keys()) #["Documents"]
+    #print(new_choices)
+    concat_choices.extend(new_choices)
+    return gr.Dropdown(choices=concat_choices), gr.Dropdown(value="No", choices = ["Yes", "No"]), gr.Dropdown(choices=concat_choices), df
+def put_columns_in_join_df(in_file, in_bm25_column):
+    '''
+    When file is loaded, update the column dropdown choices and change 'clean data' dropdown option to 'no'.
+    '''
+    print("in_bm25_column")
+    new_choices = []
+    concat_choices = []
+    df = read_file(in_file.name)
+    new_choices = list(df.columns)
+    print(new_choices)
+    concat_choices.extend(new_choices)
+    return gr.Dropdown(choices=concat_choices)
+def dummy_function(gradio_component):
+    """
+    A dummy function that exists just so that dropdown updates work correctly.
+    """
+    return None
+def display_info(info_component):
+    gr.Info(info_component)

search_funcs/semantic_functions.py ADDED Viewed

	@@ -0,0 +1,422 @@

+import os
+import time
+import pandas as pd
+from typing import Type
+import gradio as gr
+import numpy as np
+from datetime import datetime
+import accelerate
+today_rev = datetime.now().strftime("%Y%m%d")
+from transformers import AutoModel
+from torch import cuda, backends, tensor, mm
+from search_funcs.helper_functions import read_file
+# Check for torch cuda
+print("Is CUDA enabled? ", cuda.is_available())
+print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
+if cuda.is_available():
+    torch_device = "cuda"
+    os.system("nvidia-smi")
+else:
+    torch_device =  "cpu"
+print("Device used is: ", torch_device)
+#from search_funcs.helper_functions import get_file_path_end
+PandasDataFrame = Type[pd.DataFrame]
+# Load embeddings
+# Pinning a Jina revision for security purposes: https://www.baseten.co/blog/pinning-ml-model-revisions-for-compatibility-and-security/
+# Save Jina model locally as described here: https://huggingface.co/jinaai/jina-embeddings-v2-base-en/discussions/29
+embeddings_name = "jinaai/jina-embeddings-v2-small-en"
+local_embeddings_location = "model/jina/"
+revision_choice = "b811f03af3d4d7ea72a7c25c802b21fc675a5d99"
+try:
+    embeddings_model = AutoModel.from_pretrained(local_embeddings_location, revision = revision_choice, trust_remote_code=True,local_files_only=True, device_map="auto")
+except:
+    embeddings_model = AutoModel.from_pretrained(embeddings_name, revision = revision_choice, trust_remote_code=True, device_map="auto")
+# Chroma support is currently deprecated
+# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
+#import chromadb
+#from chromadb.config import Settings
+#from typing_extensions import Protocol
+#from chromadb import Documents, EmbeddingFunction, Embeddings
+# Remove Chroma database file. If it exists as it can cause issues
+#chromadb_file = "chroma.sqlite3"
+#if os.path.isfile(chromadb_file):
+#    os.remove(chromadb_file)
+def get_file_path_end(file_path):
+    # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
+    basename = os.path.basename(file_path)
+    # Then, split the basename and its extension and return only the basename without the extension
+    filename_without_extension, _ = os.path.splitext(basename)
+    #print(filename_without_extension)
+    return filename_without_extension
+def load_embeddings(embeddings_name = embeddings_name):
+    '''
+    Load embeddings model and create a global variable based on it.
+    '''
+    # Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
+    #else:
+    embeddings_func = AutoModel.from_pretrained(embeddings_name, trust_remote_code=True, device_map="auto")
+    global embeddings
+    embeddings = embeddings_func
+    return embeddings
+def docs_to_jina_embed_np_array(docs_out, in_file, return_intermediate_files = "No", embeddings_super_compress = "No", embeddings = embeddings_model, progress=gr.Progress()):
+    '''
+    Takes a Langchain document class and saves it into a Chroma sqlite file.
+    '''
+    print(f"> Total split documents: {len(docs_out)}")
+    #print(docs_out)
+    page_contents = [doc.page_content for doc in docs_out]
+    ## Load in pre-embedded file if exists
+    file_list = [string.name for string in in_file]
+    #print(file_list)
+    embeddings_file_names = [string for string in file_list if "embedding" in string]
+    data_file_names = [string for string in file_list if "tokenised" not in string]
+    data_file_name = data_file_names[0]
+    data_file_name_no_ext = get_file_path_end(data_file_name)
+    out_message = "Document processing complete. Ready to search."
+    if embeddings_file_names:
+        print("Loading embeddings from file.")
+        embeddings_out = np.load(embeddings_file_names[0])['arr_0']
+        # If embedding files have 'super_compress' in the title, they have been multiplied by 100 before save
+        if "super_compress" in embeddings_file_names[0]:
+            embeddings_out /= 100
+        # print("embeddings loaded: ", embeddings_out)
+    if not embeddings_file_names:
+        tic = time.perf_counter()
+        print("Starting to embed documents.")
+        #embeddings_list = []
+        #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
+        #    embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
+        embeddings_out = embeddings.encode(sentences=page_contents, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina embeddings
+        #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
+        #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
+        toc = time.perf_counter()
+        time_out = f"The embedding took {toc - tic:0.1f} seconds"
+        print(time_out)
+        # If you want to save your files for next time
+        if return_intermediate_files == "Yes":
+            if embeddings_super_compress == "No":
+                semantic_search_file_name = data_file_name_no_ext + '_' + 'semantic_search_embeddings.npz'
+                np.savez_compressed(semantic_search_file_name, embeddings_out)
+            else:
+                semantic_search_file_name = data_file_name_no_ext + '_' + 'semantic_search_embeddings_super_compress.npz'
+                embeddings_out_round = np.round(embeddings_out, 3)
+                embeddings_out_round *= 100 # Rounding not currently used
+                np.savez_compressed(semantic_search_file_name, embeddings_out_round)
+            return out_message, embeddings_out, semantic_search_file_name
+        return out_message, embeddings_out, None
+    print(out_message)
+    return out_message, embeddings_out, None#, None
+def process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column):
+    def create_docs_keep_from_df(df):
+        dict_out = {'ids' : [df['ids']],
+                    'documents': [df['documents']],
+                    'metadatas': [df['metadatas']],
+                    'distances': [round(df['distances'].astype(float), 4)],
+                    'embeddings': None
+                    }
+        return dict_out
+    # Prepare the DataFrame by transposing
+    #df_docs = df#.apply(lambda x: x.explode()).reset_index(drop=True)
+    # Keep only documents with a certain score
+    #print(df_docs)
+    docs_scores = df_docs["distances"] #.astype(float)
+    # Only keep sources that are sufficiently relevant (i.e. similarity search score below threshold below)
+    score_more_limit = df_docs.loc[docs_scores > vec_score_cut_off, :]
+    #docs_keep = create_docs_keep_from_df(score_more_limit) #list(compress(docs, score_more_limit))
+    #print(docs_keep)
+    if score_more_limit.empty:
+        return pd.DataFrame()
+    # Only keep sources that are at least 100 characters long
+    docs_len = score_more_limit["documents"].str.len() >= 100
+    #print(docs_len)
+    length_more_limit = score_more_limit.loc[docs_len == True, :] #pd.Series(docs_len) >= 100
+    #docs_keep = create_docs_keep_from_df(length_more_limit) #list(compress(docs_keep, length_more_limit))
+    #print(length_more_limit)
+    if length_more_limit.empty:
+        return pd.DataFrame()
+    length_more_limit['ids'] = length_more_limit['ids'].astype(int)
+    #length_more_limit.to_csv("length_more_limit.csv", index = None)
+    # Explode the 'metadatas' dictionary into separate columns
+    df_metadata_expanded = length_more_limit['metadatas'].apply(pd.Series)
+    #print(length_more_limit)
+    #print(df_metadata_expanded)
+    # Concatenate the original DataFrame with the expanded metadata DataFrame
+    results_df_out = pd.concat([length_more_limit.drop('metadatas', axis=1), df_metadata_expanded], axis=1)
+    results_df_out = results_df_out.rename(columns={"documents":orig_df_col})
+    results_df_out = results_df_out.drop(["page_section", "row", "source", "id"], axis=1, errors="ignore")
+    results_df_out['distances'] = round(results_df_out['distances'].astype(float), 3)
+    # Join back to original df
+    # results_df_out = orig_df.merge(length_more_limit[['ids', 'distances']], left_index = True, right_on = "ids", how="inner").sort_values("distances")
+    # Join on additional files
+    if in_join_file:
+        join_filename = in_join_file.name
+        # Import data
+        join_df = read_file(join_filename)
+        join_df[in_join_column] = join_df[in_join_column].astype(str).str.replace("\.0$","", regex=True)
+        # Duplicates dropped so as not to expand out dataframe
+        join_df = join_df.drop_duplicates(in_join_column)
+        results_df_out[search_df_join_column] = results_df_out[search_df_join_column].astype(str).str.replace("\.0$","", regex=True)
+        results_df_out = results_df_out.merge(join_df,left_on=search_df_join_column, right_on=in_join_column, how="left").drop(in_join_column, axis=1)
+    return results_df_out
+def jina_simple_retrieval(new_question_kworded:str, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
+                           vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None, device = torch_device, embeddings = embeddings_model, progress=gr.Progress()): # ,vectorstore, embeddings
+    # print("vectorstore loaded: ", vectorstore)
+    # Convert it to a PyTorch tensor and transfer to GPU
+    vectorstore_tensor = tensor(vectorstore).to(device)
+    # Load the sentence transformer model and move it to GPU
+    embeddings = embeddings.to(device)
+    # Encode the query using the sentence transformer and convert to a PyTorch tensor
+    query = embeddings.encode(new_question_kworded)
+    query_tensor = tensor(query).to(device)
+    if query_tensor.dim() == 1:
+        query_tensor = query_tensor.unsqueeze(0)  # Reshape to 2D with one row
+    # Normalize the query tensor and vectorstore tensor
+    query_norm = query_tensor / query_tensor.norm(dim=1, keepdim=True)
+    vectorstore_norm = vectorstore_tensor / vectorstore_tensor.norm(dim=1, keepdim=True)
+    # Calculate cosine similarities (batch processing)
+    cosine_similarities = mm(query_norm, vectorstore_norm.T)
+    # Flatten the tensor to a 1D array
+    cosine_similarities = cosine_similarities.flatten()
+    # Convert to a NumPy array if it's still a PyTorch tensor
+    cosine_similarities = cosine_similarities.cpu().numpy()
+    # Create a Pandas Series
+    cosine_similarities_series = pd.Series(cosine_similarities)
+    # Pull out relevent info from docs
+    page_contents = [doc.page_content for doc in docs]
+    page_meta = [doc.metadata for doc in docs]
+    ids_range = range(0,len(page_contents))
+    ids = [str(element) for element in ids_range]
+    df_docs = pd.DataFrame(data={"ids": ids,
+                                "documents": page_contents,
+                                    "metadatas":page_meta,
+                                    "distances":cosine_similarities_series}).sort_values("distances", ascending=False).iloc[0:k_val,:]
+    results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
+    # If nothing found, return error message
+    if results_df_out.empty:
+        return 'No result found!', None
+    results_df_name = "semantic_search_result_" + today_rev + ".csv"
+    results_df_out.to_csv(results_df_name, index= None)
+    results_first_text = results_df_out.iloc[0, 1]
+    return results_first_text, results_df_name
+# Deprecated Chroma functions - kept just in case needed in future.
+def docs_to_chroma_save_deprecated(docs_out, embeddings = embeddings_model, progress=gr.Progress()):
+    '''
+    Takes a Langchain document class and saves it into a Chroma sqlite file. Not currently used.
+    '''
+    print(f"> Total split documents: {len(docs_out)}")
+    #print(docs_out)
+    page_contents = [doc.page_content for doc in docs_out]
+    page_meta = [doc.metadata for doc in docs_out]
+    ids_range = range(0,len(page_contents))
+    ids = [str(element) for element in ids_range]
+    tic = time.perf_counter()
+    #embeddings_list = []
+    #for page in progress.tqdm(page_contents, desc = "Preparing search index", unit = "rows"):
+    #    embeddings_list.append(embeddings.encode(sentences=page, max_length=1024).tolist())
+    embeddings_list = embeddings.encode(sentences=page_contents, max_length=256, show_progress_bar = True, batch_size = 32).tolist() # For Jina embeddings
+    #embeddings_list = embeddings.encode(sentences=page_contents, normalize_embeddings=True).tolist() # For BGE embeddings
+    #embeddings_list = embeddings.encode(sentences=page_contents).tolist() # For minilm
+    toc = time.perf_counter()
+    time_out = f"The embedding took {toc - tic:0.1f} seconds"
+    #pd.Series(embeddings_list).to_csv("embeddings_out.csv")
+    # Jina tiny
+    # This takes about 300 seconds for 240,000 records = 800 / second, 1024 max length
+    # For 50k records:
+    # 61 seconds at 1024 max length
+    # 55 seconds at 512 max length
+    # 43 seconds at 256 max length
+    # 31 seconds at 128 max length
+    # The embedding took 1372.5 seconds at 256 max length for 655,020 case notes
+    # BGE small
+    # 96 seconds for 50k records at 512 length
+    # all-MiniLM-L6-v2
+    # 42.5 seconds at (256?) max length
+    # paraphrase-MiniLM-L3-v2
+    # 22 seconds for 128 max length
+    print(time_out)
+    chroma_tic = time.perf_counter()
+    # Create a new Chroma collection to store the documents and metadata. We don't need to specify an embedding fuction, and the default will be used.
+    client = chromadb.PersistentClient(path="./last_year", settings=Settings(
+    anonymized_telemetry=False))
+    try:
+        print("Deleting existing collection.")
+        #collection = client.get_collection(name="my_collection")
+        client.delete_collection(name="my_collection")
+        print("Creating new collection.")
+        collection = client.create_collection(name="my_collection")
+    except:
+        print("Creating new collection.")
+        collection = client.create_collection(name="my_collection")
+    # Match batch size is about 40,000, so add that amount in a loop
+    def create_batch_ranges(in_list, batch_size=40000):
+        total_rows = len(in_list)
+        ranges = []
+        for start in range(0, total_rows, batch_size):
+            end = min(start + batch_size, total_rows)
+            ranges.append(range(start, end))
+        return ranges
+    batch_ranges = create_batch_ranges(embeddings_list)
+    print(batch_ranges)
+    for row_range in progress.tqdm(batch_ranges, desc = "Creating vector database", unit = "batches of 40,000 rows"):
+        collection.add(
+        documents = page_contents[row_range[0]:row_range[-1]],
+        embeddings = embeddings_list[row_range[0]:row_range[-1]],
+        metadatas = page_meta[row_range[0]:row_range[-1]],
+        ids = ids[row_range[0]:row_range[-1]])
+        #print("Here")
+    # print(collection.count())
+    #chatf.vectorstore = vectorstore_func
+    chroma_toc = time.perf_counter()
+    chroma_time_out = f"Loading to Chroma db took {chroma_toc - chroma_tic:0.1f} seconds"
+    print(chroma_time_out)
+    out_message = "Document processing complete"
+    return out_message, collection
+def chroma_retrieval_deprecated(new_question_kworded:str, vectorstore, docs, orig_df_col:str, k_val:int, out_passages:int,
+                           vec_score_cut_off:float, vec_weight:float, in_join_file = None, in_join_column = None, search_df_join_column = None, embeddings = embeddings_model): # ,vectorstore, embeddings
+            query = embeddings.encode(new_question_kworded).tolist()
+            docs = vectorstore.query(
+            query_embeddings=query,
+            n_results= k_val # No practical limit on number of responses returned
+            #where={"metadata_field": "is_equal_to_this"},
+            #where_document={"$contains":"search_string"}
+            )
+            df_docs = pd.DataFrame(data={'ids': docs['ids'][0],
+                                    'documents': docs['documents'][0],
+                                    'metadatas':docs['metadatas'][0],
+                                    'distances':docs['distances'][0]#,
+                                    #'embeddings': docs['embeddings']
+                                    })
+            results_df_out = process_data_from_scores_df(df_docs, in_join_file, out_passages, vec_score_cut_off, vec_weight, orig_df_col, in_join_column, search_df_join_column)
+            results_df_name = "semantic_search_result.csv"
+            results_df_out.to_csv(results_df_name, index= None)
+            results_first_text = results_df_out[orig_df_col].iloc[0]
+            return results_first_text, results_df_name

search_funcs/{ingest.py → semantic_ingest_functions.py} RENAMED Viewed

@@ -4,27 +4,17 @@ import os
 import time
 import re
 import ast
 import pandas as pd
 import gradio as gr
 from typing import Type, List, Literal
-from langchain.text_splitter import RecursiveCharacterTextSplitter
 from pydantic import BaseModel, Field
 # Creating an alias for pandas DataFrame using Type
 PandasDataFrame = Type[pd.DataFrame]
-# class Document(BaseModel):
-#     """Class for storing a piece of text and associated metadata. Implementation adapted from Langchain code: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/documents/base.py"""
-#     page_content: str
-#     """String text."""
-#     metadata: dict = Field(default_factory=dict)
-#     """Arbitrary metadata about the page content (e.g., source, relationships to other
-#         documents, etc.).
-#     """
-#     type: Literal["Document"] = "Document"
 class Document(BaseModel):
     """Class for storing a piece of text and associated metadata. Implementation adapted from Langchain code: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/documents/base.py"""
@@ -36,25 +26,30 @@ class Document(BaseModel):
     """
     type: Literal["Document"] = "Document"
 split_strat = ["\n\n", "\n", ". ", "! ", "? "]
-chunk_size = 500
 chunk_overlap = 0
 start_index = True
 ## Parse files
-def determine_file_type(file_path):
-        """
-        Determine the file type based on its extension.
-        Parameters:
-            file_path (str): Path to the file.
-        Returns:
-            str: File extension (e.g., '.pdf', '.docx', '.txt', '.html').
-        """
-        return os.path.splitext(file_path)[1].lower()
-def parse_file(file_paths, text_column='text'):
     """
     Accepts a list of file paths, determines each file's type based on its extension,
     and passes it to the relevant parsing function.
@@ -87,16 +82,16 @@ def parse_file(file_paths, text_column='text'):
     file_names = []
     for file_path in file_paths:
-        print(file_path.name)
         #file = open(file_path.name, 'r')
         #print(file)
-        file_extension = determine_file_type(file_path.name)
         if file_extension in extension_to_parser:
             parsed_contents[file_path.name] = extension_to_parser[file_extension](file_path.name)
         else:
             parsed_contents[file_path.name] = f"Unsupported file type: {file_extension}"
-        filename_end = get_file_path_end(file_path.name)
         file_names.append(filename_end)
@@ -117,7 +112,7 @@ def text_regex_clean(text):
         return text
-def parse_csv_or_excel(file_path, text_column = "text"):
         """
         Read in a CSV or Excel file.
@@ -133,91 +128,50 @@ def parse_csv_or_excel(file_path, text_column = "text"):
         file_list = [string.name for string in file_path]
-        print(file_list)
-        data_file_names = [string for string in file_list if "tokenised" not in string]
         #for file_path in file_paths:
-        file_extension = determine_file_type(data_file_names[0])
-        file_name = get_file_path_end(data_file_names[0])
-        file_names = [file_name]
-        print(file_extension)
-        if file_extension == ".csv":
-                df = pd.read_csv(data_file_names[0], low_memory=False)
-                if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
-                df['source'] = file_name
-                df['page_section'] = ""
-        elif file_extension == ".xlsx":
-                df = pd.read_excel(data_file_names[0], engine='openpyxl')
-                if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
-                df['source'] = file_name
-                df['page_section'] = ""
-        elif file_extension == ".parquet":
-                df = pd.read_parquet(data_file_names[0])
-                if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
-                df['source'] = file_name
-                df['page_section'] = ""
-        else:
-                print(f"Unsupported file type: {file_extension}")
-                return pd.DataFrame(), ['Please choose a valid file type']
         message = "Loaded in file. Now converting to document format."
         print(message)
-        return df, file_names, message
-def get_file_path_end(file_path):
-    match = re.search(r'(.*[\/\\])?(.+)$', file_path)
-    filename_end = match.group(2) if match else ''
-    return filename_end
 # +
 # Convert parsed text to docs
 # -
-def text_to_docs(text_dict: dict, chunk_size: int = chunk_size) -> List[Document]:
-    """
-    Converts the output of parse_file (a dictionary of file paths to content)
-    to a list of Documents with metadata.
-    """
-    doc_sections = []
-    parent_doc_sections = []
-    for file_path, content in text_dict.items():
-        ext = os.path.splitext(file_path)[1].lower()
-        # Depending on the file extension, handle the content
-        # if ext == '.pdf':
-        #     docs, page_docs = pdf_text_to_docs(content, chunk_size)
-        # elif ext in ['.html', '.htm', '.txt', '.docx']:
-        #     docs = html_text_to_docs(content, chunk_size)
-        if ext in ['.csv', '.xlsx']:
-            docs, page_docs = csv_excel_text_to_docs(content, chunk_size)
-        else:
-            print(f"Unsupported file type {ext} for {file_path}. Skipping.")
-            continue
-        filename_end = get_file_path_end(file_path)
-        #match = re.search(r'(.*[\/\\])?(.+)$', file_path)
-        #filename_end = match.group(2) if match else ''
-        # Add filename as metadata
-        for doc in docs: doc.metadata["source"] = filename_end
-        #for parent_doc in parent_docs: parent_doc.metadata["source"] = filename_end
-        doc_sections.extend(docs)
-        #parent_doc_sections.extend(parent_docs)
-    return doc_sections#, page_docs
 def write_out_metadata_as_string(metadata_in):
     # If metadata_in is a single dictionary, wrap it in a list
     if isinstance(metadata_in, dict):
@@ -228,74 +182,39 @@ def write_out_metadata_as_string(metadata_in):
 def combine_metadata_columns(df, cols):
-    df['metadatas'] = "{"
-    df['blank_column'] = ""
     for n, col in enumerate(cols):
         df[col] = df[col].astype(str).str.replace('"',"'").str.replace('\n', ' ').str.replace('\r', ' ').str.replace('\r\n', ' ').str.cat(df['blank_column'].astype(str), sep="")
-        df['metadatas'] = df['metadatas'] + '"' + cols[n] + '": "' + df[col] + '", '
-    df['metadatas'] = (df['metadatas'] + "}").str.replace(', }', '}')
-    return df['metadatas']
-def csv_excel_text_to_docs(df, text_column='text', chunk_size=None) -> List[Document]:
-    """Converts a DataFrame's content to a list of Documents with metadata."""
-    #print(df.head())
-    print("Converting to documents.")
-    doc_sections = []
-    df[text_column] = df[text_column].astype(str) # Ensure column is a string column
-    # For each row in the dataframe
-    for idx, row in df.iterrows():
-        # Extract the text content for the document
-        doc_content = row[text_column]
-        # Generate metadata containing other columns' data
-        metadata = {"row": idx + 1}
-        for col, value in row.items():
-            if col != text_column:
-                metadata[col] = value
-        metadata_string = write_out_metadata_as_string(metadata)[0]
-        # If chunk_size is provided, split the text into chunks
-        if chunk_size:
-            # Assuming you have a text splitter function similar to the PDF handling
-            text_splitter = RecursiveCharacterTextSplitter(
-               chunk_size=chunk_size,
-               chunk_overlap=chunk_overlap,
-               split_strat=split_strat,
-               start_index=start_index
-            ) #Other arguments as required by the splitter
-            sections = text_splitter.split_text(doc_content)
-            # For each section, create a Document object
-            for i, section in enumerate(sections):
-                section = '. '.join([metadata_string, section])
-                doc = Document(page_content=section,
-                              metadata={**metadata, "section": i, "row_section": f"{metadata['row']}-{i}"})
-                doc_sections.append(doc)
-            #print("Chunking currently disabled")
-        else:
-            # If no chunk_size is provided, create a single Document object for the row
-            #doc_content = '. '.join([metadata_string, doc_content])
-            doc = Document(page_content=doc_content, metadata=metadata)
-            doc_sections.append(doc)
-        message = "Data converted to document format. Now creating/loading document embeddings."
-        print(message)
-    return doc_sections, message
 def clean_line_breaks(text):
     # Replace \n and \r\n with a space
@@ -322,14 +241,106 @@ def parse_metadata(row):
         # Handle the error or log it
         return None  # or some default value
-def csv_excel_text_to_docs(df, text_column='text', chunk_size=None, progress=gr.Progress()) -> List[Document]:
     """Converts a DataFrame's content to a list of dictionaries in the 'Document' format, containing page_content and associated metadata."""
     ingest_tic = time.perf_counter()
     doc_sections = []
     df[text_column] = df[text_column].astype(str).str.strip() # Ensure column is a string column
     cols = [col for col in df.columns if col != text_column]
     df["metadata"] = combine_metadata_columns(df, cols)
@@ -341,71 +352,75 @@ def csv_excel_text_to_docs(df, text_column='text', chunk_size=None, progress=gr.
     #doc_sections = df[["page_content", "metadata"]].to_dict(orient='records')
     #doc_sections = [Document(**row) for row in df[["page_content", "metadata"]].to_dict(orient='records')]
     # Create a list of Document objects
     doc_sections = [Document(page_content=row['page_content'],
                         metadata= parse_metadata(row["metadata"]))
-               for index, row in progress.tqdm(df.iterrows(), desc = "Splitting up text", unit = "rows")]
     ingest_toc = time.perf_counter()
     ingest_time_out = f"Preparing documents took {ingest_toc - ingest_tic:0.1f} seconds"
     print(ingest_time_out)
-    return doc_sections, "Finished splitting documents"
-# # Functions for working with documents after loading them back in
-def pull_out_data(series):
-    # define a lambda function to convert each string into a tuple
-    to_tuple = lambda x: eval(x)
-    # apply the lambda function to each element of the series
-    series_tup = series.apply(to_tuple)
-    series_tup_content = list(zip(*series_tup))[1]
-    series = pd.Series(list(series_tup_content))#.str.replace("^Main post content", "", regex=True).str.strip()
-    return series
-def docs_from_csv(df):
-    import ast
-    documents = []
-    page_content = pull_out_data(df["0"])
-    metadatas = pull_out_data(df["1"])
-    for x in range(0,len(df)):
-        new_doc = Document(page_content=page_content[x], metadata=metadatas[x])
-        documents.append(new_doc)
-    return documents
-def docs_from_lists(docs, metadatas):
-    documents = []
-    for x, doc in enumerate(docs):
-        new_doc = Document(page_content=doc, metadata=metadatas[x])
-        documents.append(new_doc)
-    return documents
-def docs_elements_from_csv_save(docs_path="documents.csv"):
-    documents = pd.read_csv(docs_path)
-    docs_out = docs_from_csv(documents)
-    out_df = pd.DataFrame(docs_out)
-    docs_content = pull_out_data(out_df[0].astype(str))
-    docs_meta = pull_out_data(out_df[1].astype(str))
-    doc_sources = [d['source'] for d in docs_meta]
-    return out_df, docs_content, docs_meta, doc_sources

 import time
 import re
 import ast
+import gzip
 import pandas as pd
 import gradio as gr
 from typing import Type, List, Literal
+#from langchain.text_splitter import RecursiveCharacterTextSplitter
 from pydantic import BaseModel, Field
 # Creating an alias for pandas DataFrame using Type
 PandasDataFrame = Type[pd.DataFrame]
 class Document(BaseModel):
     """Class for storing a piece of text and associated metadata. Implementation adapted from Langchain code: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/documents/base.py"""
     """
     type: Literal["Document"] = "Document"
+# Constants for chunking - not currently used
 split_strat = ["\n\n", "\n", ". ", "! ", "? "]
+chunk_size = 512
 chunk_overlap = 0
 start_index = True
+from search_funcs.helper_functions import get_file_path_end_with_ext, detect_file_type, get_file_path_end
+from search_funcs.bm25_functions import save_prepared_bm25_data
+from search_funcs.clean_funcs import initial_clean
 ## Parse files
+# def detect_file_type(file_path):
+#         """
+#         Determine the file type based on its extension.
+#         Parameters:
+#             file_path (str): Path to the file.
+#         Returns:
+#             str: File extension (e.g., '.pdf', '.docx', '.txt', '.html').
+#         """
+#         return os.path.splitext(file_path)[1].lower()
+def parse_file_not_used(file_paths, text_column='text'):
     """
     Accepts a list of file paths, determines each file's type based on its extension,
     and passes it to the relevant parsing function.
     file_names = []
     for file_path in file_paths:
+        #print(file_path.name)
         #file = open(file_path.name, 'r')
         #print(file)
+        file_extension = detect_file_type(file_path.name)
         if file_extension in extension_to_parser:
             parsed_contents[file_path.name] = extension_to_parser[file_extension](file_path.name)
         else:
             parsed_contents[file_path.name] = f"Unsupported file type: {file_extension}"
+        filename_end = get_file_path_end_with_ext(file_path.name)
         file_names.append(filename_end)
         return text
+def parse_csv_or_excel(file_path, data_state, text_column = "text"):
         """
         Read in a CSV or Excel file.
         file_list = [string.name for string in file_path]
+        #print(file_list)
+        data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
+        data_file_name = data_file_names[0]
         #for file_path in file_paths:
+        file_name = get_file_path_end_with_ext(data_file_name)
+        #print(file_extension)
+        # if file_extension == "csv":
+        #         df = pd.read_csv(data_file_names[0], low_memory=False)
+        #         if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
+        #         df['source'] = file_name
+        #         df['page_section'] = ""
+        # elif file_extension == "xlsx":
+        #         df = pd.read_excel(data_file_names[0], engine='openpyxl')
+        #         if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
+        #         df['source'] = file_name
+        #         df['page_section'] = ""
+        # elif file_extension == "parquet":
+        #         df = pd.read_parquet(data_file_names[0])
+        #         if text_column not in df.columns: return pd.DataFrame(), ['Please choose a valid column name']
+        #         df['source'] = file_name
+        #         df['page_section'] = ""
+        # else:
+        #         print(f"Unsupported file type: {file_extension}")
+        #         return pd.DataFrame(), ['Please choose a valid file type']
+        df = data_state
+        #df['source'] = file_name
+        #df['page_section'] = ""
         message = "Loaded in file. Now converting to document format."
         print(message)
+        return df, file_name, message
 # +
 # Convert parsed text to docs
 # -
 def write_out_metadata_as_string(metadata_in):
     # If metadata_in is a single dictionary, wrap it in a list
     if isinstance(metadata_in, dict):
 def combine_metadata_columns(df, cols):
+    df['metadata'] = '{'
+    df['blank_column'] = ''
     for n, col in enumerate(cols):
         df[col] = df[col].astype(str).str.replace('"',"'").str.replace('\n', ' ').str.replace('\r', ' ').str.replace('\r\n', ' ').str.cat(df['blank_column'].astype(str), sep="")
+        df['metadata'] = df['metadata'] + '"' + cols[n] + '": "' + df[col] + '", '
+    df['metadata'] = (df['metadata'] + "}").str.replace(', }', '}').str.replace('", }"', '}')
+    return df['metadata']
+def split_string_into_chunks(input_string, max_length, split_symbols):
+    # Check if input_string or split_symbols are empty
+    if not input_string or not split_symbols:
+        return [input_string]
+    chunks = []
+    current_chunk = ""
+    for char in input_string:
+        current_chunk += char
+        if len(current_chunk) >= max_length or char in split_symbols:
+            # Add the current chunk to the chunks list
+            chunks.append(current_chunk)
+            current_chunk = ""
+    # Adding any remaining part of the string
+    if current_chunk:
+        chunks.append(current_chunk)
+    return chunks
 def clean_line_breaks(text):
     # Replace \n and \r\n with a space
         # Handle the error or log it
         return None  # or some default value
+# def csv_excel_text_to_docs_deprecated(df, text_column='text', chunk_size=None) -> List[Document]:
+#     """Converts a DataFrame's content to a list of Documents with metadata."""
+#     print("Converting to documents.")
+#     doc_sections = []
+#     df[text_column] = df[text_column].astype(str) # Ensure column is a string column
+#     # For each row in the dataframe
+#     for idx, row in df.iterrows():
+#         # Extract the text content for the document
+#         doc_content = row[text_column]
+#         # Generate metadata containing other columns' data
+#         metadata = {"row": idx + 1}
+#         for col, value in row.items():
+#             if col != text_column:
+#                 metadata[col] = value
+#         metadata_string = write_out_metadata_as_string(metadata)[0]
+#         # If chunk_size is provided, split the text into chunks
+#         if chunk_size:
+#             sections = split_string_into_chunks(doc_content, chunk_size, split_strat)
+#             # Langchain usage deprecated
+#             # text_splitter = RecursiveCharacterTextSplitter(
+#             #    chunk_size=chunk_size,
+#             #    chunk_overlap=chunk_overlap,
+#             #    split_strat=split_strat,
+#             #    start_index=start_index
+#             # ) #Other arguments as required by the splitter
+#             # sections = text_splitter.split_text(doc_content)
+#             # For each section, create a Document object
+#             for i, section in enumerate(sections):
+#                 section = '. '.join([metadata_string, section])
+#                 doc = Document(page_content=section,
+#                               metadata={**metadata, "section": i, "row_section": f"{metadata['row']}-{i}"})
+#                 doc_sections.append(doc)
+#         else:
+#             # If no chunk_size is provided, create a single Document object for the row
+#             #doc_content = '. '.join([metadata_string, doc_content])
+#             doc = Document(page_content=doc_content, metadata=metadata)
+#             doc_sections.append(doc)
+#         message = "Data converted to document format. Now creating/loading document embeddings."
+#         print(message)
+#     return doc_sections, message
+def csv_excel_text_to_docs(df, in_file, text_column='text', clean = "No", return_intermediate_files = "No", chunk_size=None, progress=gr.Progress()) -> List[Document]:
     """Converts a DataFrame's content to a list of dictionaries in the 'Document' format, containing page_content and associated metadata."""
+    file_list = [string.name for string in in_file]
+    data_file_names = [string for string in file_list if "tokenised" not in string and "embeddings" not in string]
+    data_file_name = data_file_names[0]
+    # Check if file is a document format, and explode out as needed
+    if "prepared_docs" in data_file_name:
+        print("Loading in documents from file.")
+        #print(df[0:5])
+        #section_series = df.iloc[:,0]
+        #section_series = "{" + section_series + "}"
+        doc_sections = df
+        print(doc_sections[0])
+        # Convert each element in the Series to a Document instance
+        #doc_sections = section_series.apply(lambda x: Document(**x))
+        return doc_sections, "Finished preparing documents"
+    #    df = document_to_dataframe(df.iloc[:,0])
     ingest_tic = time.perf_counter()
     doc_sections = []
     df[text_column] = df[text_column].astype(str).str.strip() # Ensure column is a string column
+    if clean == "Yes":
+        clean_tic = time.perf_counter()
+        print("Starting data clean.")
+        df = df.drop_duplicates(text_column)
+        df[text_column] = initial_clean(df[text_column])
+        df_list = list(df[text_column])
+        # Save to file if you have cleaned the data
+        out_file_name, text_column = save_prepared_bm25_data(data_file_name, df_list, df, text_column)
+        clean_toc = time.perf_counter()
+        clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
+        print(clean_time_out)
     cols = [col for col in df.columns if col != text_column]
     df["metadata"] = combine_metadata_columns(df, cols)
     #doc_sections = df[["page_content", "metadata"]].to_dict(orient='records')
     #doc_sections = [Document(**row) for row in df[["page_content", "metadata"]].to_dict(orient='records')]
     # Create a list of Document objects
     doc_sections = [Document(page_content=row['page_content'],
                         metadata= parse_metadata(row["metadata"]))
+                for index, row in progress.tqdm(df.iterrows(), desc = "Splitting up text", unit = "rows")]
     ingest_toc = time.perf_counter()
     ingest_time_out = f"Preparing documents took {ingest_toc - ingest_tic:0.1f} seconds"
     print(ingest_time_out)
+    if return_intermediate_files == "Yes":
+        data_file_out_name_no_ext = get_file_path_end(data_file_name)
+        file_name = data_file_out_name_no_ext + "_cleaned"
+        #print(doc_sections)
+        #page_content_series_string = pd.Series(doc_sections).astype(str)
+        #page_content_series_string = page_content_series_string.str.replace(" type='Document'", "").str.replace("' metadata=", "', 'metadata':").str.replace("page_content=", "{'page_content':")
+        #page_content_series_string = page_content_series_string + "}"
+        #print(page_content_series_string[0])
+        #metadata_series_string = pd.Series(doc_sections[1]).astype(str)
+        import pickle
+        if clean == "No":
+            #pd.DataFrame(data = {"Documents":page_content_series_string}).to_parquet(file_name + "_prepared_docs.parquet")
+            with gzip.open(file_name + "_prepared_docs.pkl.gz", 'wb') as file:
+                pickle.dump(doc_sections, file)
+            #pd.Series(doc_sections).to_pickle(file_name + "_prepared_docs.pkl")
+        elif clean == "Yes":
+            #pd.DataFrame(data = {"Documents":page_content_series_string}).to_parquet(file_name + "_prepared_docs_clean.parquet")
+            with gzip.open(file_name + "_prepared_docs_clean.pkl.gz", 'wb') as file:
+                pickle.dump(doc_sections, file)
+            #pd.Series(doc_sections).to_pickle(file_name + "_prepared_docs_clean.pkl")
+        print("Documents saved to file.")
+    return doc_sections, "Finished preparing documents."
+def document_to_dataframe(documents):
+    '''
+    Convert an object in document format to pandas dataframe
+    '''
+    rows = []
+    for doc in documents:
+        # Convert Document to dictionary and extract metadata
+        doc_dict = doc.dict()
+        metadata = doc_dict.pop('metadata')
+        # Add the page_content and type to the metadata
+        metadata['page_content'] = doc_dict['page_content']
+        metadata['type'] = doc_dict['type']
+        # Add to the list of rows
+        rows.append(metadata)
+    # Create a DataFrame from the list of rows
+    df = pd.DataFrame(rows)
+    return df
+# Example usage
+#documents = [
+#    Document(page_content="Example content 1", metadata={"author": "Author 1", "year": 2021}),
+#    Document(page_content="Example content 2", metadata={"author": "Author 2", "year": 2022})
+#]
+#df = document_to_dataframe(documents)
+#df